[05:56:47] good morning [05:59:36] The two disk partitions full are related to yarn logs, will follow up [06:05:19] !log truncate logs for application_1615988861843_158592 on analytics1061 - one partition full [06:05:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:08:25] RECOVERY - Disk space on Hadoop worker on analytics1061 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:10:50] !log kill application_1615988861843_158592 on analytics1061 to allow space to recover (truncate of course in D state) [06:10:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:17:12] !log kill application application_1615988861843_158645 to free space on analytics1070 [06:17:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:17:36] I am working with Aiko on this, tensorflow on Yarn for some reason spams the yarn logs a lot [06:19:55] RECOVERY - Disk space on Hadoop worker on analytics1070 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:06:22] 10Analytics-Radar, 10WMDE-TechWish-Sprint-2021-03-31: Reportupdater output can be corrupted by hive logging - https://phabricator.wikimedia.org/T275757 (10awight) 05Open→03Resolved a:03awight [08:50:21] * elukey hates hue [08:50:35] the new 4.9 version doesn't work in test sigh [08:50:41] the hive panel is broken [09:02:47] anything obvious? [09:09:13] good morning :) [09:10:02] I am trying to revert some commits and rebuild, I suspect it is due to some changes to thrift, but upstream is not super collaborative [09:10:24] https://github.com/cloudera/hue/issues/1997 [09:10:56] their release model is break first, patch later [09:11:38] anyway, I'd like to upgrade hue behind hue-next.wikimedia.org since we are going to switch hue.wikimedia.org to it soon [09:11:55] hue next uses the Hue github repo [09:12:02] (self packaged by us) [09:12:13] meanwhile hue.wikimedia.org uses the CDH 5 Hue [09:12:24] is it always HEAD or does it go by tags? [09:12:39] there are tags yes [09:12:41] https://www.cloudera.com/downloads/paywall-expansion.html [09:12:50] this is the main reason to drop CDH 5 Hue [09:13:01] I'd like to clean up our repositories just in case [09:13:09] Ah. [09:13:20] yeah lovely I know :( [09:13:54] (on the bright side, we switched to Bigtop just in time) [09:14:36] So if Hue goes subscription-only, what will it be replaced with? [09:15:25] in theory it is only their packages etc.., the github repo's code is licensed with apache 2.0 [09:15:49] and I really hope to drop Hue when airflow comes in, but not sure if we'll make it [09:16:37] There is also https://phabricator.wikimedia.org/T264896 with the upstream bugs to fix [09:16:55] (upstream is receptive to pull requests only basically) [09:17:36] so the best thing to do is probably to assign somebody in the team to fix these, but it may take a while and not sure what the priority will be (low to lowest probably) [09:50:23] I tried to rollback to 4.8, the prev version (in test), and same error, now I am confused [10:10:51] fgDid you try the page before starting the update? Maybe it always has been broken that way? [10:13:32] It used to work, I didn't test it recently, and we have the same version running on an-tool1009 (the hue-next.wikimedia.org backend) that works fine [10:14:23] the test cluster might have something special, but the errors are really cryptic [10:18:13] `Exception in thread "IPC Client (1969564802) connection to an-master1001.eqiad.wmnet/10.64.5.26:8032 from klausman@WIKIMEDIA" java.lang.OutOfMemoryError: GC overhead limit exceeded` <-- oops. [10:20:27] :) [10:22:41] ok I think that I found the issue [10:23:22] I have upgraded an-test-coord1001 to buster recently, as prep-step/test for an-coord1001 [10:23:31] !log deploying aqs with updated cassandra libraries to aqs1004 while depooled [10:23:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:23:40] part of the changes involve moving hive to org.mariadb.jdbc.Driver [10:23:50] since the mysql one is not available anymore [10:24:34] I suspect that Hue doesn't like it [10:24:57] I pointed Hue test to an-coord1001 and all worked as expected [10:25:20] this is a big problem :( [10:25:41] isn't MariaDB the same as Mysql on the wire? [10:26:01] hnowlan: little nit for the future - if possible add a deploy msg to scap like "testing only on aqs1004 blabla" [10:26:41] elukey: good call, always realise that too late with scap :/ [10:27:21] hnowlan: yes yes no big deal, it happens the same to mee :) [10:27:59] klausman: hic sunt leones, I think yes but they diverged for some things, maybe there is a specific query that mariadb doesn't digest [10:28:27] :-S [10:28:39] Does Hue at least log the failed query somewhere? [10:30:33] yes yes but I am 100% sure that upstream will not try to solve the problem since they support mysql [10:31:33] Could you try running the query manually in a mariadb console? [10:31:54] Could you try running the query manually in a mariadb console? [10:32:03] oops, wrong window for up-arrow and return [10:32:22] rolled back aqs on aqs1004, turns out there's some kind of problem with schemas the library has [10:33:09] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2021.04.13?id=P5LEyngBA6MeBtBqgyBE [10:37:11] hnowlan: good outcome of the test then! [10:37:34] klausman: this is a good brainbounce, I thought that the query was for the hive metastore, but it may be for mariadb, will try! [10:39:07] but then, it wouldn't be hue making it [10:39:23] because hue talks only to the metastore [10:39:27] mmmmm [10:40:36] fg [10:40:39] gah :) [10:41:33] I am wondering how many other things will break if we upgrade an-coord1001 [10:41:50] I have an-coord1002 already upgraded, at least we could failover to it and test [10:42:04] Hue seems the only thing complaining in hadoop test [10:43:01] Yeah, just pointing things at 1002 and seeing where the smoke comes out might be a good test balloon. With advance warning etc. [10:59:24] bbiab, lunch and groceries. [11:08:33] lunch for me too [11:43:22] 10Analytics: Top read repeats - https://phabricator.wikimedia.org/T280011 (10dr0ptp4kt) [11:52:07] good morning dear team [12:25:12] hola! [12:44:41] 10Analytics, 10Event-Platform: Deploy schema repos to analytics cluster and use local uris for analytics jobs - https://phabricator.wikimedia.org/T280017 (10Ottomata) [13:04:08] 10Analytics, 10SRE, 10Traffic: Add Traffic's notion of "from public cloud" to Analytics webrequest data - https://phabricator.wikimedia.org/T279380 (10fgiunchedi) p:05Triage→03Medium [13:05:53] a-team FYI I will be merging some patches that change around Refine and related monitoring jobs today. Hopefully it won't trigger any false alerts but I can't promise! :) [13:09:55] ottomata: would you have a few minutes to talk about ATS/Varnish-Kafka? [13:11:56] mornin everyone [13:14:27] ack! [13:14:30] morning milimetric [13:15:20] hiya elukey. I am having so much empathetic pain over your pain with Hue [13:16:14] klausman: ya gimme a few just merging something then for sure [13:17:31] sure, I'll be in the BC [13:18:39] !log Refine now uses refinery-job 0.1.4; RefineFailuresChecker has been removed and its function rolled into RefineMonitor - [13:18:39] T273789 [13:18:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:18:41] T273789: Sanitize and ingest all event tables into the event_sanitized database - https://phabricator.wikimedia.org/T273789 [13:19:03] milimetric: in this case it is a bigger mess, it seems the JDBC mariadb driver between hive server and metastore :( [13:19:06] but only Hue complains [13:19:42] yeah, maybe Hue's the only sane one in this case :P But nevertheless, I just wish it didn't exist [13:20:12] as soon as we're done with Gobblin I'm going to push the AirFlow migration as fast as I can [13:24:02] PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:24:36] me ^ fixing [13:25:34] milimetric: I think Airflow is part of the problem, a lot of people use the hive editor via Hue so we'll have to empower Superset too :( [13:25:59] i think once we get rid of oozie we can encourage people to stop using hue [13:29:23] yes I agree but we need also to get Presto in a better shape too [13:29:45] otherwise we'll have to drop some use cases (like people using the hive editor in hue) [13:30:00] I am not defending hue of course, just bringing up some points :) [13:30:21] yeah, i think that would be good, but this might be a prime example of 'dropping systems' that we don't strictly need [13:30:32] some people will lose out on some functionality, but we have too much stuff do support ya know? [13:30:40] PROBLEM - Check unit status of refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:30:48] checking^ i [13:31:00] anyway hopefully we'd have a manager to make that decision :p [13:31:39] the main problem right now is that if I upgrade an-coord1001 to buster I break the Hue hive editor [13:32:03] oof [13:32:29] elukey: ... wait why? sorry maybe i missed thread... [13:32:46] klausman: almost with ya, just making sure refine still works! [13:33:07] np, I'm not in a particular rush :) [13:34:40] RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:35:29] ottomata: on buster we don't have anymore libmysql-java, that brings com.mysql.jdbc.Driver. It is replaced by libmariadb-java, that provides org.mariadb.jdbc.Driver (that needs to be set also in the hive-site.xml) [13:35:59] elukey: if we need a jar, we can probably get and deploy it via maven [13:36:54] ottomata: sure but it was dropped by debian and it would be another special use case [13:37:05] I am still not sure why this fails [13:37:22] I suspect that libmariadb-dev's jar may not be fully compatible [13:37:42] I am wondering if debian 11/sid versions might be better [13:38:36] PROBLEM - Check unit status of refine_mediawiki_job_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:41:34] RECOVERY - Check unit status of refine_event on an-launcher1002 is OK: OK: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:19:08] 10Analytics: Easy dimensional data visualization - https://phabricator.wikimedia.org/T280029 (10Milimetric) [14:19:42] RECOVERY - Check unit status of refine_mediawiki_job_events on an-launcher1002 is OK: OK: Status of the systemd unit refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:24:33] elukey: i wonder if relying on debian for language dependencies for things like java and python is just not the right thing to do anyway [14:26:09] ottomata: for base libs it makes sense in my opinion [14:26:37] the more the customize the more we diverge from the debian security policies [14:27:18] right, but, if we were going the k8s, which we probably will services like hue and hive-server, the deps would be in the docker image anyway [14:27:39] its a bit easier to manage the deps with the language package manager than with debian [14:27:51] esp when debian upstream is not the one creating the service debian packages [14:28:16] yes but it is also a bigger problem when we need to track dependencies [14:28:32] for example, a CVE is out and we need to figure out what to patch [14:28:44] for stuff like python deps I would highly recommend using the pipeline blubber stuff if possible, it handles installing deps and containerising them pretty consistently [14:29:18] yes as long as we use something approved by SRE (and controlled) I am +1 [14:29:32] hnowlan: aye would be good to get java in there nicely too [14:29:47] although, hnowlan python application deps are not really managed by SRE, right? [14:29:56] they are just part of the image build process [14:30:10] no, they're generally managed by dependent teams and/or whoever is building hte service [14:30:17] yeah I dunno if there's any java support yet [14:30:17] right, i think that's going to be the general trend [14:30:25] makes sense imo [14:30:26] for java as well [14:30:42] but elukey is right, it means that handling CVEs will be the job of the service owning teams, not SRE [14:31:42] and for extra libs etc.. it makes complete sense, but it is surely easier for interpreters/std-libs to be managed via deb upstream [14:31:43] yeah and assumes that teams have the capability to do that... and that they have an awareness. it's a lot of variables, but it's also the path we started to go down when the pipeline was made a policy for services. that's not a useful answer but the ship's kinda sailing [14:32:01] coming at this with little context but - if we were to get away from using system packages for java stuff at all, what would be the tool used for installing the dependencies? [14:32:14] hnowlan: probably maven [14:32:24] if in an image, that would just be part of the build [14:32:37] if like we do now...i guess scap + archive + git-fat :/ [14:32:42] archiva* [14:32:56] cool - in theory the maven step could just be part of the build step within the pipeline, doesn't necessarily *need* native support [14:33:01] right [14:33:11] elukey: but what is a 'std lib'? mysql client? [14:33:12] not sure [14:33:57] no well my thought was generic, not related to mysql [14:34:00] services like hue/ hive are built against specific versions of deps, if debian changes out the version underneath them, then breakages like this are probably likely [14:34:25] yeah, i think for build pipeline SRE/RelEng maintains certain base images [14:34:32] like for specific versions of nodejs [14:34:36] this is more deep, namely that Debian chose to keep only mariadb-related libraries [14:34:58] right, right, which hive/hue may not know about? since they dependon mysql libs? [14:36:33] they offer both mysql and mariadb connector configs, but in this case there may be a bug [14:36:40] aye [14:39:00] elukey: gguessing you are working on this on an-test-coord1001? [14:39:08] i just ran puppet there and saw Notice: /Stage[main]/Bigtop::Hive::Server/Service[hive-server2]/ensure: ensure changed 'stopped' to 'running' (corrective) [14:39:21] yes yes I can stop now, please test anything [14:39:56] I hoped to test the bulleye libmariadb-mysql package but hive doesn't pick it up in the classpath [14:40:40] anyway, I think that for the moment we can move hue-next -> hue anyway [14:41:01] the cloudera paywall is a little scary since we have their packages in our public apt repo [14:45:14] (03PS1) 10Hnowlan: Revert "package: bump restbase-mod-table-cassandra" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/678859 [14:48:12] 10Analytics, 10Patch-For-Review: Fix the remaining bugs open on for Hue next - https://phabricator.wikimedia.org/T264896 (10elukey) https://github.com/cloudera/hue/issues/1997 [14:49:44] fdans: hola! Was https://gerrit.wikimedia.org/r/c/analytics/refinery/+/658348/ deployed? [14:49:56] it is the only thing that I can see in the kanban [14:50:01] the train is empty [14:50:22] elukey: yes, it is deployed, thank you! [14:50:31] not yet in done because backfilling [14:51:46] perfect [14:52:04] I am going to send an email to the team for the empty train, in case something comes up I can do it tomorrow [15:15:56] elukey: i just noticed in hadoop::directory define that if ensure is absent [15:15:57] https://github.com/wikimedia/puppet/blob/production/modules/bigtop/manifests/hadoop/directory.pp#L53 [15:16:08] hadoop-hdfs-namenode is required [15:16:19] but, that define can be used anywhere, including nodes that don't run a namenode [15:16:24] i should remove the require, right? [15:17:06] OPHHHH but it can ohly be used in places that have the hdfs userr [15:17:06] hm [15:18:00] oh no that is everywhere [15:18:07] ah we just can't by default sudo to hdfs uh huh [15:18:10] ok yeah that should be fine then [15:19:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/678870 [15:30:59] ottomata: the camus timestamp extractor thing was able to look at multiple fields to extract the timestamp. That's still needed right, to handle schema migration? [15:31:28] ottomata: in theory yes, I think that the original reasoning was to avoid issuing directory creation requests if the namenode was down [15:34:56] milimetric: hm. yes i think that's right [15:35:12] k, cool, will do [15:35:14] thx [15:35:23] i think its kind of a useful feature too, and probably not hard to implement....if you are extracting the time from the payload anyway [15:35:44] i don't know if i would support fallback for multiple timestamp sources, e.g. kafka and then event data...or...hmm i guess that would be cool [15:35:47] but maybe that's difficult? [15:35:57] elukey: aye ok [15:48:25] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:01:16] a-team standup! [16:04:28] 10Analytics-Radar, 10Growth-Scaling, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Product-Analytics (Kanban): Growth: update welcome survey aggregation schedule - https://phabricator.wikimedia.org/T275172 (10nettrom_WMF) This work is 90% done. The notebook is updated and ready for automatic mont... [16:16:03] bearloga: o/ ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/678864? [16:16:19] elukey: yup! [16:17:22] !log rebalance kafka partitions for webrequest_text partitions 19, 20 [16:17:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:18:09] elukey: thank you! I've got stat1007:/srv/discovery/venv all ready and all of reportupdater's dependencies installed there, so theoretically next time it runs it should backfill all the missing reports since feb 8 [16:18:50] elukey: please let me know if you continue to see errors from systemd timer [16:19:35] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:21:27] bearloga: yeah I think we need to set /srv/discovery/venv/lib/python3.7/site-packages as PYTHONPATH [16:21:37] without the " " quoting [16:21:46] oh good just commented ^ [16:21:59] silly me I should have checked [16:22:15] yes yes my bad [16:22:16] ottomata elukey: ah! thank you, my bad [16:22:57] perhaps golden main.sh should just use your venv's python though? [16:23:17] bearloga: elukey ^? [16:23:52] yes it could be an option as well [16:24:15] having the PYTHONPATH set seemed good but we can do anything [16:24:16] ottomata: I don't want it dependent on my env [16:24:47] ...? isn't it aleady? [16:25:24] so /srv/discovery/venv/lib/python3.7/site-packages works fine, the timer is running now [16:25:53] ottomata: it was, yes, but I want that job to be self-contained going forward [16:26:05] milimetric: no rush, but two things: [16:26:06] 1) could you take a look at my latest comment on https://phabricator.wikimedia.org/T261681 and leave your thoughts? [16:26:06] 2) In the AQS code here https://github.com/wikimedia/analytics-aqs/blob/master/sys/pageviews.js#L198-L207 , every time the per-article endpoint serves a request, it has to redefine the highlighted functions. from my understanding this is not the best practice - is there something I'm missing or is it just that the performance difference doesn't matter too much? [16:27:09] elukey: wait, how? doesn't it need a separate patch to set pythonpath correctly? [16:27:35] elukey: or did you just test it manually first? [16:28:49] checked manually yes, I filed the patch to fix :) [16:28:55] thanks so much! [16:29:52] should have reviewed the other one more carefully :) [16:33:03] bearloga: but...what's the difference between using the venv's python and adding its site-packages to PYTHONPATH in this case? [16:33:19] oh you just don't want that link in the script? [16:33:46] RECOVERY - Check unit status of wikimedia-discovery-golden on stat1007 is OK: OK: Status of the systemd unit wikimedia-discovery-golden https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:33:51] goood [16:36:29] ottomata: I suppose I could have updated main.sh to call that venv's python, sure. the puppet pythonpath solution just... felt better [16:37:13] aye [17:03:28] (03PS1) 10GoranSMilovanovic: minor20210413 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/678902 [17:03:44] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] minor20210413 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/678902 (owner: 10GoranSMilovanovic) [17:38:03] * elukey afk! ttl [18:57:39] (03PS1) 10Ottomata: ProduceCanaryEvents - include httpRequest body in failure message [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/678919 (https://phabricator.wikimedia.org/T274951) [19:35:14] * razzi afk for lunch [19:50:12] razzi: lemme know if you wanna sync, happy to join :) [19:50:41] (i know there haven't been a lot of work days since our last sync :) ) [20:01:47] ottomata: yeah we can skip for today [20:01:59] k! [21:09:28] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:19:30] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:42:44] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:47:57] ^-- Looks like canary events fixed themselves again, but breaking twice in an hour is concerning [21:52:44] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:17:23] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add better monitoring for Analytics UIs - https://phabricator.wikimedia.org/T277729 (10razzi) I implemented a check that works on both staging and production, using the appropriate header for production (x-cas-uid rather than x-remote-user). I re-enabled a...