[07:12:43] 10Analytics-Clusters, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) Reporting a conversation with Chris over email about when to do the maintenance: > The 4th works for me at 1130EST If pos... [07:26:17] !log re-run cassandra-daily-coord-local_group_default_T_pageviews_per_article_flat failed hour via hue [07:26:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:26:26] err failed day [07:26:36] so it seems again due to timeouts to aqs1004-6 [07:37:41] and again MutationStage [07:38:06] the only metric that I see misbehaving a lot is the cpu usage/load, that is way worse on 1004-1006 [07:38:41] the number of threads of the stage is dictated by the concurrent_writes option (not by num processor as others) and it is 32 [07:39:37] maybe we could think about lowering it down for 1004-1006 [07:40:32] * elukey brb [08:43:08] 10Analytics-Clusters, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Thanks a lot for the insights! So few comments: * As far of current hadoop workers/rows distribution, we have: 13 in A, 14 in B, 19 in C, 1... [08:43:24] lovely --^ [08:43:34] we don't have enough 10g space for our hadoop beefy nodes :D [10:13:20] (03PS1) 10Elukey: Update hadoop-test's test client [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/638444 [10:13:47] (03CR) 10Elukey: [V: 03+2 C: 03+2] Update hadoop-test's test client [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/638444 (owner: 10Elukey) [10:38:07] joal: bonjour! later on today I'll have to shutdown an-coord1001 for ram expansion, so I'll drain the cluster at around 15:00 CET [10:41:07] Bonjour elukey :) [10:41:24] No problem I'll be around for an extra pair of eyes if needed :) [10:41:37] (kids at 16:00, but here before) [10:43:25] ack thanks :) [10:43:31] going afk for a bit for groceries! [11:24:20] joal: I am querying brand new webrequest data just refined in the test cluster :D [11:24:25] \o/ [11:24:49] \o/ indeed! [11:24:54] Happy me :) [11:25:13] thank you thank you [11:26:00] good news is that the DNS CNAME for hive also works [11:26:12] I'll test the failover later on to see if TTLs are respected [11:26:28] This is GREAT ! [11:26:44] if so, we can schedule the big oozie restart :D [11:27:00] indeed [11:27:12] and then transitioning to an-coord1002 should be a matter of DNS [11:28:10] (it also works with beeline, just tested) [11:41:03] * elukey lunch! [12:50:45] 10Analytics-Radar, 10Better Use Of Data, 10Product-Analytics, 10Product-Infrastructure-Data, and 4 others: Session Length Metric. Web implementation - https://phabricator.wikimedia.org/T248987 (10Gilles) [13:02:17] !log force a restart of performance-asoranking.service on stat1007 after fix for pandas' sort() - T266985 [13:02:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:02:21] T266985: asoranking failed its monthly run stat1007 - https://phabricator.wikimedia.org/T266985 [13:10:01] so I have t ested on an-test-client1001 a hive query with hive, beeline, oozie (webrequest_load) and spark (pyspark and spark2-shell) [13:10:04] all good [13:10:15] so I am going to test the failover via https://gerrit.wikimedia.org/r/c/operations/dns/+/638502/ [13:10:45] we don't have anything on an-coord1002, so the idea is that I'll see failures in oozie etc.. [13:10:52] and also recoveries when I flip back [13:12:46] ok done [13:12:49] let's see how it goes [13:21:21] java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://analytics-test-hive.eqiad.wmnet:10000/default: java.net.ConnectException: Connection timed out (Connection timed out) [13:21:25] \o/ [13:21:28] this is oozie [13:23:25] scala> spark.sql("select * from wmf.webrequest where year=2020 and month=10 and day=21 and hour=2 limit 10").show(10) [13:23:28] 20/11/03 13:22:41 WARN metastore: Failed to connect to the MetaStore Server.. [13:23:52] yep it seems that we are good [13:23:58] everything now fails as expected [13:24:31] * elukey dances [13:24:46] * joal dances with elukey :) [14:02:51] !log stop timers on an-launcher1002 to drain the cluster (an-coord1001 maintenance prep-step) [14:02:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:23:01] (03CR) 10Joal: "Tested on webrequest. Gain is relatively small but exists." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/638021 (https://phabricator.wikimedia.org/T267009) (owner: 10Joal) [14:44:24] of course the only thing that doesn't seem to work is... Hue [14:50:29] I am also packaging Hue 4.8.0, let's see if anything is better [15:15:22] cluster seems drained [15:36:46] (03CR) 10Mforns: [C: 03+1] "LGTM! I vaguely recall having to downgrade a library to make it compatible with our code, but the only one changed in this patch that is n" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/638040 (owner: 10Joal) [15:39:27] ok an-coord1001 is down, we'll see [15:39:59] the main issue is that the shutdown was not clean for mariadb (due to a miscomunication with dcops), so I really hope that mariadb comes up after the boot [15:41:38] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10I18n: [[Wikimedia:Wikistats-metrics-top-mediarequests-name/jam]] translation issue - https://phabricator.wikimedia.org/T266669 (10Dentonius) Thanks for replying @fdans . It appears that it was just a test question for me to be admitted as a transla... [15:49:32] ah there was an oozie shared lib install when puppet ran [15:49:38] hopefully it should be ok [15:51:27] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10I18n: [[Wikimedia:Wikistats-metrics-top-mediarequests-name/jam]] translation issue - https://phabricator.wikimedia.org/T266669 (10fdans) 05Open→03Resolved [15:51:40] seems ok, just came up [15:52:10] !log re-enable timers after maintenance [15:52:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:57:18] joal: hi! :] does hive compiler optimize repeated expressions? [15:58:04] mforns: I think you need the joal-compiler for that [15:58:13] hehehe [15:58:15] :D [15:59:11] elukey: joal-compiler definitely optimizes, actually, it's better than chuck-norris-compiler [15:59:27] mforns: I completely agree [15:59:32] xD [16:01:13] (03CR) 10Mforns: Improve webrequest-refine query shuffle stage (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/638086 (https://phabricator.wikimedia.org/T267008) (owner: 10Joal) [16:10:51] (03CR) 10Mforns: Add caching to maxmind readers in core package (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/638021 (https://phabricator.wikimedia.org/T267009) (owner: 10Joal) [16:14:40] I also swapped with Chris the NIC on kafka-jumbo1006, now it is running 10g, all good [16:19:01] 10Analytics-Clusters, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10elukey) Tested in labs the CNAME failover to something that is not running any daemon at the moment (an-coord1002) and I got in the expected times failures for connection time... [16:21:10] 10Analytics-Clusters, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) Today I swapped the NIC on kafka-jumbo1006 with Chris and there was no need for /etc/network/interfaces changes, `firmware-... [16:52:41] !log mv /srv/analytics.wikimedia.org/published/datasets/archive/public-datasets to /srv/backup/public-datasets on thorium - T265971 [16:52:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:52:46] T265971: Check data currently stored on thorium and drop what it is not needed anymore - https://phabricator.wikimedia.org/T265971 [16:54:41] 10Analytics: Check data currently stored on thorium and drop what it is not needed anymore - https://phabricator.wikimedia.org/T265971 (10elukey) This is the current status: ` root@thorium:/srv# du -hs * | sort -h 4.0K log 8.0K deployment 16K lost+found 85M src 3.5G org.wikimedia.community-anal... [17:06:38] Anyone know if Grafana has an API? [17:06:59] Basically I want this (https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&refresh=1m&from=now-6h&to=now-1m) in machine-readable form [17:07:04] chrisalbon: to get graphs or data? [17:07:11] data [17:07:25] (that was a good question) [17:07:32] chrisalbon: grafana actually gets its data from APIs [17:07:41] grafana is about graphs only [17:07:58] I didn't know if it did the summarizing (counts etc) too [17:08:09] chrisalbon: good quextion! [17:08:17] ha [17:08:19] chrisalbon: I think the data-serving systems do it [17:08:24] but I'm not sure [17:08:30] ah cool, thanks, I can dig in [17:09:42] the metrics is in prometheus, so in theory you can query its API to get something like sum(ores_uwsgi_busy_workers_count) [17:09:52] the metrics *are [17:10:17] and we have a prometheus lib in pywmflib [17:10:23] chrisalbon: --^ [17:10:29] (if you are using python) [17:10:39] oh sweet thanks [17:11:17] chrisalbon: in case of question, always mention elukey - He's a bit part of our knowledge base :) [17:11:28] lol [17:11:30] cool [17:14:03] Does pywmflib have docs? [17:16:32] chrisalbon: ah snap sorry prometheus I think it wasn't ported yet - https://doc.wikimedia.org/wmflib/v0.0.4/ [17:16:57] ah oh well, thanks anyway [17:17:08] (we are factoring out from the spicerack lib that we us for cookbooks etc.. lemme see if I can find examples) [17:18:01] chrisalbon: https://github.com/wikimedia/operations-software-spicerack/blob/master/spicerack/prometheus.py [17:18:36] it is very simple, it uses request [17:18:51] sweeeet [17:18:54] I like simple [17:19:05] (we are factoring out stuff from spicerack, that is used mostly for our automation cookbooks etc.. we == SRE) [17:38:05] https://www.irccloud.com/pastebin/nQkXRKp9/ [17:38:09] :( [17:40:12] chrisalbon: from where are you curling? because that works inside the wmf network [17:40:45] oh, good point [17:41:41] In my head ores grafana=public meant ores prometheus=public, but of course there is no reason that is the case [17:44:28] yeah chrisalbon the prometheus instances themselves are on the private network, for a few reasons :) [17:44:43] but it's quite simple to do some curls internally, or forward the ports over ssh [17:44:57] yeah, working on that now. Awesome thanks [17:45:16] happy to help if you get stuck or have questions :) [17:46:21] Great! This is a good learning opportunity for me so I might come back with questions [18:15:37] (03PS5) 10Joal: Add caching to maxmind readers in core package [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/638021 (https://phabricator.wikimedia.org/T267009) [18:16:36] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/638021 (https://phabricator.wikimedia.org/T267009) (owner: 10Joal) [18:17:02] (03CR) 10Joal: Add caching to maxmind readers in core package (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/638021 (https://phabricator.wikimedia.org/T267009) (owner: 10Joal) [18:17:06] Thanks mforns :) [18:17:15] :] [18:17:31] mforns: weill investigate hive CSE later after diner :) [18:17:46] ok! lmk if you find sth interesting! [18:36:34] * elukey afk! [19:45:29] 10Analytics-Radar, 10Operations, 10ops-eqiad: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) 05Open→03Resolved @elukey the an-presto1004 motherboard has been replaced and the backplane, everything came back up as normal except I am not able to ssh into the server and fresh i... [21:30:11] GoranSM: o/ [21:31:32] GoranSM: a question came up today as part of the Research's office hour which you may be able to help. It's about how to only count edits in Wikidata done directly in Wikidata (as opposed to through other projects). I signed up to follow up on the question. Do you have an answer to this? [22:21:31] GoranSM: I have another question for you for when you will be around. There was a question during our office hours about Wikidata editors and what we know about them. If I understood correctly, it's about the aggregate profile of users: how many are volunteers, how many are from different organizations/institutions, ... and how do these group collaborate with each other. Any pointer [22:21:34] for such research? For more context, check Topic 5 in https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours [23:21:43] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:32:21] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers