[00:04:17] 10Analytics, 10Beta-Cluster-Infrastructure, 10Event-Platform, 10MW-1.36-notes (1.36.0-wmf.20; 2020-12-01), and 2 others: Server returned error: HTTP 500 appears while trying to open VE or reply to a comment on Beta cluster - https://phabricator.wikimedia.org/T268184 (10Jdforrester-WMF) 05Open→03Resolved... [02:06:44] PROBLEM - Check the last execution of refinery-import-siteinfo-dumps on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refinery-import-siteinfo-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:09:49] mforns: moved docs for traffic entrophy to a more generic name, will also rework those a bit as part of work in writing blogpost, ciaooo [06:27:51] nuria: thanks! [06:50:03] !log restart refinery-import-siteinfo-dumps.service on an-launcher1002 [06:50:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:53:12] RECOVERY - Check the last execution of refinery-import-siteinfo-dumps on an-launcher1002 is OK: OK: Status of the systemd unit refinery-import-siteinfo-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:07:23] !log roll restart java daemons on Hadoop test for openjdk upgrades [07:07:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:25:41] Good morning [07:28:11] Thanks a lot elukey for the fix of dumps imports [07:35:41] joal: today it starts my ops week! :) [07:35:46] bonjour :) [07:36:33] elukey: I feel embarrassed you have an ops week :S To me you do it every week! [07:37:26] joal: ahahah please don't, it is good to help you folks not being on-call every couple of weeks [07:37:37] plus it is a good refresh of things that I don't do often [07:37:47] (like deploying) [07:38:39] elukey: As you always help others during their ops weeks, please ask for any help you'd need :) [07:40:05] of course! [07:41:09] elukey: I need to make myself excused for my recent ball-dropping ;) [07:45:02] ahahhah noooooo [07:46:37] Dentist appointment, back in a few [09:03:42] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on an-presto1004 - https://phabricator.wikimedia.org/T268171 (10Peachey88) [09:08:23] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on an-presto1004 - https://phabricator.wikimedia.org/T268171 (10elukey) This is related to work the Chris did yesterday on the node, I'll wait for the rebuild to finish before closing. [09:11:21] joal: (when you are back) - https://gerrit.wikimedia.org/r/640448 [09:11:41] Back I am elukey - reading [09:11:55] basically the dcops team is going to move some hadoop workers between racks (no row or ip change) [09:12:28] so what I thought was to update the rack config for hdfs ahead of time (we have to restart masters for java upgrades), giving it time to shuffle blocks if needed etc.. [09:12:42] super fine elukey [09:12:43] and then let dcops to move nodes in a few days [09:12:47] does it sound ok? [09:12:51] ah okok perfect :) [09:18:55] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10JAllemandou) Some thoughts about OLAP engine for Cloud. [[ https://prestosql.io/ | P... [09:29:21] elukey: quick question - shall we start using the new CNAME kerb for hive, or not yet? [09:29:35] joal: depends for what :) [09:29:42] elukey: new jobs [09:30:03] joal: if those are oozie jobs, yes please [09:30:55] ok will ask that in the CR [09:30:57] thanks elukey [09:40:29] elukey, klausman, chrisalbon - An interesting one I think: https://github.com/criteo/mlflow-yarn [09:41:27] and a second one (adding ebernhardson for the ES bit): https://github.com/criteo/mlflow-elasticsearchstore [09:41:27] nice [09:43:55] joal: https://www.oreilly.com/library/view/kubeflow-for-machine/9781492050117/ is also nice afaics [10:03:45] 10Analytics-Kanban: Analytics Hardware for Fiscal Year 2020/2021 - https://phabricator.wikimedia.org/T255145 (10elukey) [10:04:17] 10Analytics-Kanban: Analytics Hardware for Fiscal Year 2020/2021 - https://phabricator.wikimedia.org/T255145 (10elukey) [10:12:25] 10Analytics: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10elukey) [10:38:49] 10Analytics, 10Analytics-Kanban: Refactor puppet profiles to reduce hiera pollution - https://phabricator.wikimedia.org/T268220 (10elukey) [10:38:57] 10Analytics: Refactor puppet profiles to reduce hiera pollution - https://phabricator.wikimedia.org/T268220 (10elukey) [10:40:16] self nerd snipe --^ [10:43:03] Glad I am not the only one prone to that [10:43:31] Also, morning [10:43:54] good morning :) [10:52:48] 10Analytics: Refactor puppet profiles to reduce hiera pollution - https://phabricator.wikimedia.org/T268220 (10elukey) https://gerrit.wikimedia.org/r/c/operations/puppet/+/641940 [11:26:48] so I decided that in my ops week I'll focus also on tech debt [11:26:57] like puppet refactorings etc.. [11:28:29] That it a bottomless pit. Sometimes a bottomless pit of despair. I applaud your sacrifice. [11:28:38] (and am saddened by its necessity) [11:29:06] klausman: this is my pit of sadness https://phabricator.wikimedia.org/T240437 [11:29:41] but it helps me to collect things that needs to be "fixed"/automated/etc.. [11:30:10] I know how big the pain is, and with Nuria we decided to dedicate time on reducing that mountain every Q [11:30:24] it has worked so far, but I need more time for big things :D [11:30:57] There nevere is enough time [11:31:53] ah also, the most important thing is that after 5y in here a ton of problems were also added by me :D [11:32:00] (past Luca, so bad) [11:32:08] so I am also balancing karma [11:42:59] * elukey lunch! [12:37:30] good morning team! [12:38:22] Hi fdans [13:37:14] hola [13:37:16] ! [14:02:38] elukey: milimetric i just sent you an invte for a meeting in 1.5h to talk with some cloud vps folks about resurrecting the public presto project [14:02:46] if you can make it that'd be swell! [14:02:48] otherwise don't worry! [14:03:09] def wiill be there [14:04:15] oh joal you wanna come too? [14:04:43] Sure ottomata - I added a comment about perf in the task [14:04:50] ya just saw [14:04:52] k inviting [14:04:55] Thanks [14:06:09] ottomata: sure! [14:06:13] (also gmorning :) [14:06:31] morning! [14:06:53] actually ottomata I'll probably be missing because of bad timing with kids - you know my points from the task :) [14:06:59] ok ! [14:07:21] sorry for the noise ottomata :) [14:26:57] (03CR) 10Joal: [C: 04-1] "Some more comments" (039 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/640146 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [14:27:15] fdans: I finally got to review that! Sorry for the waiting time :S [14:27:38] joal: nono thank you joseph, this is great :) [14:29:55] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10JAllemandou) Hi @cchen - Let's organize a meeting to try to debug this issue? [15:25:35] ack nuria, thanks! [15:25:40] heya teammmm! [15:27:15] holaaaa [15:30:10] hey elukey and joal :], I'm seeing the SLA alert for wikidata-item_page_link-weekly-coord, it seems it's stuck since October 19th? It seems the same problem with the datacenter switch. [15:32:24] ottomata: ping for meeting [15:32:43] mforns: I wanted to have a chat with you about it yes! (in a meeting now) [15:32:53] ok ok, np [15:59:46] o/ [16:02:03] ottomata joal do you have some time today (15-min) for me to run you over the goals for Fabian's first 30 days and get your feedback for the Analytics component of it? [16:02:25] hello yes sure! [16:02:33] find a slot in my cal to sqeeze it in [16:03:23] i think 3-4 my time is currently free [16:03:57] elukey: it's getting better: [16:03:58] 16:03:38.253| ATS: 26022 Varnish: 24692 Both:28866 [16:04:04] * elukey dances [16:04:06] \o/ [16:04:33] This is now with the read-pointer reset on start, and query string normalization [16:04:47] Current state of code is on 1008 [16:05:54] Current state of brain is fried :) [16:35:44] !log roll restart hadoop workers for openjdk upgrades [16:35:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:35:53] mforns: ---^ as FYI this might generate alerts [16:36:13] ok elukey, thanks! [16:37:08] Any hive metastore updates this week? Our airflow stopped talking to it (fixed now, turns out because we had the ipv4 address hardcoded instead of an-coord1001.eqiad.wmnet, never fixed after it started listening on ipv6). Anyways just writing up how this ended up being triggered in the ticket [16:38:07] basically the kerberos auth started failing, but putting swapping the name instad of the ip seems to have let it run. I don't 100% understand what went wrong either... [16:39:21] ebernhardson: no hive metastores updates, but I am curious about the IP thing - what was the kerberos error? [16:39:48] one thing that we are doing is having two hive-servers (not metastores), on an-coord1001 and an-coord1002 (new host for failover) [16:40:12] elukey: well odly, the only thing our airflow side started doing was saying 'connection already open', but further looking got: [16:40:15] thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-1) SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Server hive/10.64.21.104@WIKIMEDIA not found in Kerberos database)' [16:40:26] instead of targeting hive/an-coord1001.eqiad.wmnet@WIKIMEDIA, in the bright future people will have to use hive/analytics-hive.eqiad.wmnet@WIKIMEDIA [16:40:41] ebernhardson: ahhhh yes now I know what happened, yes there was a change [16:41:17] we disabled a feature in krb clients, that is basically DNS normalization [16:41:42] hive/10.64.21.104@WIKIMEDIA is not a principal, I think that the IP was reverse-resolved and fixed [16:41:43] ahh, ok that would make sense [16:42:01] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:42:04] yes yes so my bad, I didn't see any impact but there was a little one sorry [16:42:21] probably i should have followed up and put the name back in there as soon as it was listening to ipv6 (didn't take more than a few days) [16:42:24] i was just lazy... [16:43:00] ebernhardson: basically we disabled the "resolve the hostname via DNS and then reverse lookup etc.." since we don't want the clients to resolve analytics-hive.eqiad.wmnet [16:43:05] it is a CNAME [16:43:26] the hive server daemon on an-coord1002 is configured with hive/analytics-hive.eqiad.wmnet@WIKIMEDIA as principal [16:43:44] that makes lots of sense, i have wondered a bit about the HA capabilities of some of this :) [16:43:44] so it doesn't comply if a client gets a token for hive/an-coord1001.eqiad.wmnet@WIKIMEDIA [16:43:54] yes we are slowly doing it :) [16:44:02] the metastores are weird, I am still looking into that [16:44:18] appartently the suggested way is to have HA listing all metastore hostnames in hive-site.xml [16:44:58] (and then use a db-based session handling, as opposed to in ram) [16:45:07] yea, thats what i was seeing in the bigtop puppet stuff [16:45:16] so more to come, I'll alert you when doing things, sorry! [16:45:40] no worries, i do appreciate a heads up though :) I probably wouldn't have realized this would break, but might have known where to look [16:47:28] yep definitely [16:47:54] java daemons of course don't look up the settings that I have changed, but whatever uses C-based libraries (even wrapped) it does [16:48:09] so I guess that airflow uses python-sasl or something similar [16:48:52] yea, it uses a sasl wrapper from cloudera around c libs [16:50:58] I hate that thing [16:51:04] yes yes then it makes sense [16:52:02] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:02:07] klausman: standup?? [17:02:38] Hi leila - I'll have time after our team meetings :) [17:03:53] 10Analytics: Check data currently stored on thorium and drop what it is not needed anymore - https://phabricator.wikimedia.org/T265971 (10mforns) **tl;dr** Did some vetting of the data and seems fine. Some minor differences but probably due to the way I measured them. --- I ran `tree` on both `thorium:/srv/back... [17:36:23] 10Quarry, 10cloud-services-team (Kanban): Do some checks of how many Quarry queries will break in a multiinstance environment - https://phabricator.wikimedia.org/T267989 (10Bstorm) Adding @dcaro in case he has time or interest to help dig in that database. It's in the quarry Cloud VPS project. Local root can a... [17:42:32] ottomata: groskin? [17:43:25] joal: I'm looking at the common times we have and I think async it is for today. I'll send you and Andrew an email and maybe we can connect briefly tomorrow or we just go async. sorry. [17:44:10] sah oops comin! [17:44:22] ack leila - sync tomorrow is good :) [17:44:49] <3 [17:51:15] 10Analytics, 10Analytics-Kanban: EventStreams UI - https://phabricator.wikimedia.org/T268255 (10mforns) [17:51:25] 10Analytics, 10Analytics-Kanban: EventStreams UI - https://phabricator.wikimedia.org/T268255 (10mforns) [17:51:30] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10mforns) [17:51:55] 10Analytics, 10Event-Platform, 10Product-Infrastructure-Data: Automate EventGate validation error reporting - https://phabricator.wikimedia.org/T268027 (10fdans) p:05Triage→03Medium [17:52:41] 10Analytics-Radar, 10Operations, 10ops-eqiad: Degraded RAID on an-presto1004 - https://phabricator.wikimedia.org/T268171 (10fdans) [17:53:37] 10Analytics-Radar, 10Operations, 10ops-eqiad: Degraded RAID on an-presto1004 - https://phabricator.wikimedia.org/T268171 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson It looks like all the disks are working from my end. I am resolving this task. [17:54:35] 10Analytics: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10fdans) p:05Triage→03Medium a:03razzi [17:56:14] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Refactor puppet profiles to reduce hiera pollution - https://phabricator.wikimedia.org/T268220 (10fdans) p:05Triage→03High a:03elukey [18:01:53] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: eventgate-analytics-external occasionally seems to fail lookups of dynamic stream config from MW EventStreamConfig API - https://phabricator.wikimedia.org/T266573 (10Ottomata) Parking some new log messages: ` [2020-11-19T16:28:28.490Z]... [18:04:43] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: eventgate-analytics-external occasionally seems to fail lookups of dynamic stream config from MW EventStreamConfig API - https://phabricator.wikimedia.org/T266573 (10Ottomata) For reference, eventgate-analytics-external-production-74bc7... [18:45:58] mforns: all good! (netflow) [18:46:25] elukey: thankkks! [18:47:23] * elukey afk! o/ [18:51:18] (03PS1) 10Fdans: Pageview complete - Print explicit null values when there's no page id [analytics/refinery] - 10https://gerrit.wikimedia.org/r/642079 (https://phabricator.wikimedia.org/T267575) [19:01:28] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks fdans" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/642079 (https://phabricator.wikimedia.org/T267575) (owner: 10Fdans) [19:38:03] (03PS4) 10Cwhite: update openapi definitions to version 3 and dependency upgrades [analytics/aqs] - 10https://gerrit.wikimedia.org/r/558685 (https://phabricator.wikimedia.org/T240995) [19:38:30] (03CR) 10jerkins-bot: [V: 04-1] update openapi definitions to version 3 and dependency upgrades [analytics/aqs] - 10https://gerrit.wikimedia.org/r/558685 (https://phabricator.wikimedia.org/T240995) (owner: 10Cwhite) [19:45:54] (03PS1) 10Joal: [ONE-OFF] Add job to fix pageview-complete [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/642109 [19:45:59] milimetric: --^ [19:46:31] thanks joal, I'll take a look as I get out of meetings, though I may also just throw a blanket over my head and cry [19:46:38] I don't think I've talked this much since the day before my wedding [19:46:52] (03PS5) 10Cwhite: update openapi definitions to version 3 and dependency upgrades [analytics/aqs] - 10https://gerrit.wikimedia.org/r/558685 (https://phabricator.wikimedia.org/T240995) [19:46:59] Take it easy milimetric :) [20:44:49] 10Analytics-Radar, 10Product-Analytics, 10Structured Data Engineering, 10Patch-For-Review, and 2 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10sdkim) a:05egardner→03jlinehan PI is taking on designing and providing this sch... [20:45:06] 10Analytics-Radar, 10Product-Analytics, 10Product-Infrastructure-Data, 10Structured Data Engineering, and 3 others: Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10sdkim) [22:41:27] 10Analytics, 10Gerrit, 10Release-Engineering-Team: Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10brennen) [22:41:52] 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10brennen) [23:19:08] anyone: how can I get read access for sqoop? I'm attempting to run the following query: `sqoop list-tables --connect jdbc:mysql://labsdb1012.eqiad.wmnet/enwiki_p --driver org.mariadb.jdbc.Driver` and receive the error: `java.sql.SQLInvalidAuthorizationSpecException: Access denied for user 'lexnasser'...` [23:19:37] I know I'm supposed to query with a `--username` and `--password-file`, but I'm not sure how to obtain those