[00:47:52] PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:14:06] RECOVERY - HDFS corrupt blocks on an-master1001 is OK: (C)5 ge (W)2 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [01:54:52] RECOVERY - Check the last execution of monitor_refine_event_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:00:53] 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: Test if Hue can run with Python3 - https://phabricator.wikimedia.org/T233073 (10elukey) [06:00:55] 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: Test if Hue can run with Python3 - https://phabricator.wikimedia.org/T233073 (10elukey) [06:01:16] 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: Test if Hue can run with Python3 - https://phabricator.wikimedia.org/T233073 (10elukey) [06:01:34] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Test if Hue can run with Python3 - https://phabricator.wikimedia.org/T233073 (10elukey) [06:30:07] good morning [06:30:16] I am going to reboot stat1004 as announced [06:31:56] good morning elukey [06:33:53] bonjour! [06:54:09] joal: when you have a moment I'd like your opinion on https://gerrit.wikimedia.org/r/c/operations/puppet/+/629647/ (I can explain what the thing does) [06:54:22] reading elukey [06:57:43] elukey: here's my understanding - This patch uses facter (no clue about what it does in details, but it seems able to give you machine partitions) to setup partitions for the test-cluster hadoop workers, instead of having them manually defined [06:58:04] for all the workers, not only the testing ones [06:58:17] the hiera removed is what I have to do every time disks fails etc.. [06:58:46] elukey: hiera-removed? [06:58:52] so the new code picks up the partitions of 12-disks and 22 disks seamlessly [06:59:01] elukey: the fact that there are special-files for certain hosts? [06:59:02] yes in the patch there is a lot of hiera code removed [06:59:10] yep [06:59:29] ok - so those hosts needed special files because there cause was different from default, becasue of disk-failed [06:59:32] right? [06:59:55] yarn NM is more stupid than hdfs DN, and when a disk fails and the partition is unmounted it still tries to write to the same directories. [07:00:12] ok [07:00:17] so it needs to be told to avoid to do so, and per host explicit config is what I have been doing [07:00:27] ok [07:00:34] but with new 22 disks workers I'd have needed to add more config, etc.. [07:00:43] And now that is done automagically with this patch [07:01:12] yep it does, and there is a fail-safe if for some reason a puppet run leads to less partitions than expected (puppet fails, don't change the config and we get an alert) [07:01:16] does it sound ok? [07:01:39] it is the missing bit before adding the new workers basically [07:01:39] well I guess for me it wouldn't do any diff :) [07:01:58] elukey: joking - less manual settings --> sounds great :) [07:02:05] super thanks :) [07:02:20] we'll see how it goes, the fail-safe makes me feel more comfortable [07:02:41] the funny part is that for some hadoop workers in test [07:02:42] like https://puppet-compiler.wmflabs.org/compiler1002/25428/analytics1034.eqiad.wmnet/index.html [07:02:52] the new config has found some incosistencies [07:03:00] like partitions not mounted correctly, etc.. [07:03:11] so it is doing a better job than mine :D [07:03:35] all right merging very carefully, and then I'll complete the clean up for the TLS certs [07:05:31] \o/ [07:13:42] already found an inconsistency on analytics1044, a prod worker [07:15:51] ah wow a partition was not used [07:18:41] !log restart datanode on analytics1044 after new datanode partition settings (one partition was missing, caught by https://gerrit.wikimedia.org/r/c/operations/puppet/+/629647) [07:18:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:18:45] let's see [07:20:09] the datanode complained about some blocks, I think that the namenode told it that they were stale [07:21:25] ok I'll leave this for a bit before proceeding, I want to make sure that all is good [07:30:48] 10Analytics, 10Event-Platform, 10Technical-blog-posts: Story idea for Blog: Wikimedia's Event Platform - https://phabricator.wikimedia.org/T253649 (10Nintendofan885) @srodlund shouldn't the footers of part 1 and 2 be updated to link to part 3 [08:03:23] also found https://gerrit.wikimedia.org/r/630041 [08:03:36] that we do for hdfs datanode partitions (so in hdfs-site.xml) but not for yarn [08:03:40] so puppet may change the order [08:04:07] (not a big deal if the list is consistent but better have a canonical version) [08:07:00] ack elukey - makes sense :) [08:18:27] ok finally done [08:18:31] now cleanup of TLS stuff [08:37:15] Morning! Backup of 1006 cleaned out and backup of 1007 started [08:42:07] morning! ack [08:45:48] This machine is a lot less busy, and the backup therefore close to 2x faster [08:45:58] I'm getting >100MiB/s off-disk [08:46:17] The machine also has a mere 100 days of uptime ;) [08:46:52] the clustering around 1007 was a lot worse in the past, we had a lot of different configs for stat100x and 1007 was the preferred one [08:47:12] always raising alerts for ram usage, overloaded all the times.. [08:47:52] So now everyone moved away and it's idle? [08:48:09] ahahah yes it is funny but I believe this happened [08:48:53] works for me :-P [08:49:41] I suspect 7 is also a beefier machine than 6. At least it feels that way [08:52:11] also more recent [08:54:02] Yea, rough back-of-the-envelope guesstimation says it's about 95MiB/s so far [08:55:31] klausman: another thing that we could do, to complete the picture, is replace an-tool1006 (basically a ganeti vm to mimic stat100x on the test cluster) with another vm on buster [08:55:41] to see the whole procedure of creating vms etc.. [08:56:33] Sounds like an idea. I haven't played with Ganeti, so far. [08:57:32] !log restart daemons on analytics1052 (journalnode) to verify new TLS setting simplification (no truststore config in ssl-server.xml, not needed) [08:57:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:59:20] * elukey coffee [09:17:39] sukhe: hey, I notice there you have a ton of python processes on stat1007 that were spawned by cron, and they seem to never terminate. [09:17:54] /home/sukhe/project_monitoring/scripts/check_projects.py [09:18:46] I count 50 cron instances, each spawning 1+4 (leader, children) python processes. [09:19:24] It doesn't look like they're burning CPU, but I suspect they are meant to terminate. [09:25:55] PROBLEM - Webrequests Varnishkafka log producer on cp5001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:26:55] RECOVERY - Webrequests Varnishkafka log producer on cp5001 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:27:48] mmmm [09:28:12] varnish 6 rollout, all good [09:50:36] klausman: https://gerrit.wikimedia.org/r/c/operations/puppet/+/630108 seems a good compromise for the moment, what do you think? [09:50:43] (we forgot to ask to Andrew) [09:56:56] sounds good [09:57:33] It might break if 2.2.4 goes away, but we can burn that bridge when we get to it [09:59:30] yeah but in theory it is a default value, and as soon as the real package is installed the config gets fixed [10:00:42] I was just erring on the side of me not understanding Puppet entirely ;) [10:01:38] I still don't understand basic behaviors like require contain include etc.. [10:01:54] I mean, sometimes I hope I do understand those but puppet behaves differently [10:01:59] Namespacing in Puppet is veyry confusing [10:02:15] I find it very confusing as generic tool :D [10:02:29] There are entire *layers* of confusion with Puppet :) [10:11:20] ok so I have a patch up for the first gpu worker, but I need to follow up with John about a strategy to avoid repetition of hiera configs [10:11:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/630099/1 [10:11:36] https://gerrit.wikimedia.org/r/c/operations/puppet/+/630099 [10:11:51] after that we should be able to have the first gpu worker up [10:16:26] klausman: I have to create the kerberos keytabs for all the new hosts, we can do it together after lunch if you have time [10:17:08] Asbolutely [10:27:33] some docs are [10:27:34] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide [10:27:37] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos [10:28:46] Will give them a read over lunch [10:31:42] please don't, It is not really material that facilitates digestion :D [10:33:07] noted :) [10:37:18] * elukey lunch! [11:44:55] Moooorning [11:50:51] hi fdans :) [11:58:13] klausman: hi. yes, this is the old version of the traffic anomaly report that is no longer used. I should probably remove the cron as well [11:58:32] thanks for the ping -- I will do that today [11:59:58] thanks! [13:05:36] elukey: when do you want to do the Kerberos stuff? [13:08:26] klausman: I was finishing my coffee, anytime :) [13:08:40] Lemme get some more tea and then we'll start? [13:08:55] sure! In here, meet, etc.. ? [13:08:58] no preference [13:11:53] Let's use meet so you can screenshare [13:16:32] 1.1T of stat1007 backed up. Took ~6.75h [13:28:34] morning! [13:28:40] fdans: I'm looking now, aimin' to merge [13:29:00] milimetric: sounds good! [13:29:25] did you get a chance to look at the differences, anything you think we should include for this deploy? [13:34:21] fdans: ^ [13:36:47] (03PS9) 10Milimetric: Allow more than one dimension to be filtered in Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/612574 (https://phabricator.wikimedia.org/T255757) (owner: 10Fdans) [13:37:09] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Allow more than one dimension to be filtered in Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/612574 (https://phabricator.wikimedia.org/T255757) (owner: 10Fdans) [13:43:11] (03PS10) 10Milimetric: Add filter/split component to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/613114 (https://phabricator.wikimedia.org/T249758) (owner: 10Fdans) [13:45:07] 10Analytics, 10Platform Engineering, 10Code-Health, 10Epic, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10WDoranWMF) [13:46:52] milimetric: a lot of them are pretty straightforward, but I think I would put them all in a change whose mandate is to get rid of the concept of breakdown [14:36:29] joal: if you're available I have a noob question about editors by country [14:38:15] fdans: maybe i can answer? [14:38:49] nuria: bc pre one on one? [14:39:11] fdans: we can bc [14:41:12] ready to bootstrap the first gpu node https://gerrit.wikimedia.org/r/c/operations/puppet/+/630185 (an-worker1096) [14:41:19] so it should catch up during the weekend [14:54:48] And stat1007 backup has ground down to 10MiB/s :-/ [14:55:05] I'll poke XioNoX about it [14:58:34] klausman: I forgot to show you https://librenms.wikimedia.org/device/device=161/tab=port/port=14578/ [14:58:40] that could be interesting [14:58:58] ah, thanks [14:59:47] 10Analytics-Clusters: Put 24 Hadoop worker nodes in service (cluster expansion) - https://phabricator.wikimedia.org/T255146 (10Jclark-ctr) [15:04:25] razzi, let me know if you want to meet today for some pair work? I have a meeting in 25 mins, but in 1 hour I'll be free for the rest of the day :] [15:06:35] 10Analytics, 10Platform Team Sprints Board (Sprint 4), 10Platform Team Workboards (Green): Ingest api-gateway.request events to turnillo - https://phabricator.wikimedia.org/T261002 (10WDoranWMF) [15:07:29] 10Analytics, 10MediaWiki-REST-API, 10Patch-For-Review, 10Platform Team Sprints Board (Sprint 4), and 2 others: System administrator reviews API usage by client - https://phabricator.wikimedia.org/T251812 (10WDoranWMF) [15:14:18] fdans: found two things that we may want to fix up before deploy: [15:14:40] 10Analytics-Radar, 10Privacy Engineering, 10Product-Analytics: Clarify the data retention extension process - https://phabricator.wikimedia.org/T256776 (10JFishback_WMF) [15:14:41] 1. tooltips aren't available on the filter/split stuff [15:15:00] 2. some metrics don't work on mobile, I think it's the Wikistats 1 ones [15:15:05] (as in you can't click on them) [15:15:25] check out https://wikistats-canary.wmflabs.org/testing-filters for the latest code after the rebase [15:22:26] klausman: interesting fact - I installed linux 4.19 on an-worker1096 (basically to avoid a reboot after the hadoop daemons are bootstrapped) and rebooted [15:22:49] but the interface eno1 was renamed to eno1something so no networking [15:22:59] and I had to modify /etc/network/interfaces by hand [15:23:15] (writing this in here so more people know) [15:27:17] That's weird. what was the old kernel version? [15:28:02] 4.9.228-1 [15:29:19] mforns: Yeah let's pair when you're free [15:30:06] k! [15:39:45] 10Analytics-Clusters, 10Analytics-Kanban: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10elukey) I don't find the docs that were pointing to the fact that oozie checks Hadoop perms, so at this point I cannot really support my argument :( On the hadoop masters... [15:39:56] razzi: o/ added some info in --^ [15:40:21] I don't find the info that I thought I've read about hadoop perms checking :( [15:41:10] I can also explain the user deployment on hosts [15:41:56] ty elukey [15:42:49] !log add an-worker1096 (GPU worker) to the hadoop cluster [15:42:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:42:51] \o/ [15:44:04] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Nuria) cc @mforns that will be working on dev environment for MEP next q... [15:52:58] !log restart hdfs namenodes to correct rack settings of the new host [15:53:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:53:09] this is really annoying, I hoped it was smarter [15:55:18] mforns: o/ [15:55:47] about monitor_refine_eventlogging_legacy_failure_flags.service - I see it is still in failed state on an-launcher, we could force a restart to see if it recovers [16:03:56] razzi: if you have a min we can restart one systemd timer [16:04:12] elukey: Yeah, cya in the bc [16:04:46] ack gimme 2 mins to grab water [16:12:51] hey elukey sorry was in a meeting [16:25:17] !log systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 to clear alerts [16:25:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:29:45] 10Analytics-Clusters, 10Analytics-Kanban, 10User-Elukey: Create temporary cluster to hold a copy of data for backup purposes - https://phabricator.wikimedia.org/T263814 (10Nuria) [16:31:19] RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:51:34] 10Analytics: [SPIKE] look at prefect as possible alternative to oozie - https://phabricator.wikimedia.org/T263861 (10Nuria) [17:01:13] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Define reduce calculations needed to compute active editors per project family - https://phabricator.wikimedia.org/T249751 (10Nuria) a:05JAllemandou→03None [17:01:47] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Stats for newer projects not available - https://phabricator.wikimedia.org/T258033 (10Nuria) 05Open→03Resolved [17:02:06] 10Analytics, 10Analytics-Kanban: Sort editors-by-country by descending editor-ceil value in cassandra - https://phabricator.wikimedia.org/T262184 (10Nuria) 05Open→03Resolved [17:05:14] Wow - cool :) https://azure.microsoft.com/en-us/updates/accelerate-analytics-and-ai-workloads-with-photon-powered-delta-engine-on-azure-databricks/ [17:05:22] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10Nuria) a:05Nuria→03JAllemandou [17:05:39] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10Nuria) Update: we will be getting to this next quarter. Assigning to @JAllemandou as background task [17:16:47] joal: o/ we have an-worker1096 working :) [17:16:51] (gpu not configured yet) [17:16:58] \o/ awesome elukey :) [17:18:08] hey a-team: did stat1004 get a new SSH fingerprint with the OS upgrade like stat1006 did? If so, could https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1004.eqiad.wmnet get updated with the new fingerprints? [17:19:05] nuria: the tasked we talked about: T260409 [17:19:06] T260409: Establish what data must be backed up before the HDFS upgrade - https://phabricator.wikimedia.org/T260409 [17:23:41] Nettrom: hi! it was already upgraded IIRC [17:24:02] ah no I don't see it from the history [17:24:12] fixing [17:25:17] Nettrom: done [17:25:36] elukey: awesome, thanks so much! :) [17:26:06] now that stat1004 and stat1006 also have Debian Buster, I'm considering moving off of stat1008 to shift the loads a bit [17:26:30] nice! [17:26:35] 1007 will follow on monday [17:31:20] razzi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/630218 :) [17:31:27] * elukey afk for today! Have a nice weekend folks [18:34:42] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10egardner) [18:54:59] 10Analytics, 10puppet-compiler: Puppet catalog compiler fails to diff when change produces non-ascii accented character - https://phabricator.wikimedia.org/T263876 (10razzi) [19:27:15] (03CR) 10Jenniferwang: "Hi Expert," [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628235 (https://phabricator.wikimedia.org/T262499) (owner: 10Jenniferwang) [19:29:48] (03CR) 10Jenniferwang: "Hello, Could you help to review this checkin?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628237 (https://phabricator.wikimedia.org/T262496) (owner: 10Jenniferwang) [19:54:34] 10Analytics, 10Platform Engineering, 10Code-Health-Objective, 10Epic, and 2 others: AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10WDoranWMF) [19:54:36] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10Ramsey-WMF) Don't forget tracking our up... [20:17:09] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10egardner) [20:26:38] 10Analytics-Clusters: Create a temporary hadoop backup cluster - https://phabricator.wikimedia.org/T260411 (10Nuria) see {T260409} probably list of dataset to backup should be consolidated to google doc/wiki where we can update it more easily than on a phab ticket [20:54:17] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10EBernhardson) > I'd like to know if I'm... [21:21:27] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10Nuria) +1 to @EBernhardson 's comment... [21:28:00] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10CBogen) we want to measure the following... [21:37:01] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): Develop a new schema for MediaSearch analytics or adapt an existing one - https://phabricator.wikimedia.org/T263875 (10egardner) Thanks for the feedback, this...