[00:11:00] (03PS1) 10Jenniferwang: add SpecialMuteSubmit schema to EventLogging whitelist https://phabricator.wikimedia.org/T262499 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628238 [00:33:51] (03PS1) 10Jenniferwang: Add SpecialMuteSubmit schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628241 (https://phabricator.wikimedia.org/T262499) [00:44:57] (03Abandoned) 10Jenniferwang: Add SpecialMuteSubmit schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628241 (https://phabricator.wikimedia.org/T262499) (owner: 10Jenniferwang) [00:47:52] (03Abandoned) 10Jenniferwang: add SpecialMuteSubmit schema to EventLogging whitelist https://phabricator.wikimedia.org/T262499 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628238 (owner: 10Jenniferwang) [02:04:59] (03PS2) 10DannyS712: Add SpecialInvestigate schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628237 (https://phabricator.wikimedia.org/T262496) (owner: 10Jenniferwang) [05:31:53] 10Analytics-Clusters, 10Discovery, 10Discovery-Search (Current work): mjolnir-kafka-msearch-daemon dropping produced messages after move to search-loader[12]001 - https://phabricator.wikimedia.org/T260305 (10elukey) Thanks a lot for all the work on this, really appreciated. As far as I can see it seems that... [05:33:50] goood morning [05:34:06] PROCS CRITICAL: 0 processes with command name 'python2.7', args '/usr/lib/hue/build/env/bin/hue runcherrypyserver' [05:34:29] yep you can bet it icinga, on an-tool1009 we have python3.7 only :P [05:34:32] fixing the alert [06:28:07] Good friday morning team [06:28:54] bonjour [06:30:11] How are you elukey? [06:30:46] joal: good! I am doing a test to see if I can re-use the host-level puppet TLS certs for hadoop, so far map-reduce jobs are failing of course :D [06:30:53] all good for you? [06:31:17] yup, will play with WDQS query parser :) [06:32:33] joal: I hope that eventually what I am doing works, if so we'll have to schedule a maintenance window to swap the certs in main hadoop [06:32:51] ack elukey [06:33:01] elukey: Is there a second option? [06:33:06] in case what you tr fails? [06:34:10] yep yep [06:34:20] we currently use self signed CA + certs [06:38:14] And also I'm preping my arguments to try to make Andrew change some of his slides to more consensual perspective - This is my scary mission of the day [06:40:00] mmm I can't see the slides, it wants to send a confirmation code to analytics internal I think [06:40:10] were you able to open them without issues? [06:40:11] https://docs.google.com/presentation/d/1gYVNKHgpRqW2E_ZphQ9TXKKdgeZMHDM378fCe8roYSo/edit?ts=5f63aea4#slide=id.g9889502f1e_1_10 [06:40:21] does that work elukey ? --^ [06:40:26] yep thanks [06:41:23] ahahahha "luca made coffee at 2pm" [06:41:33] I generated an event [06:41:52] elukey: You're a producer :) [06:42:16] PROBLEM - Hue Gunicorn Python server on an-tool1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3.7, args /usr/lib/hue/build/env/bin/hue rungunicornserver https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue/Administration [06:45:38] ahh hue is always so nice [06:46:44] ah yes this is Luca's fault [06:48:00] joal: TLS for shufflers works [06:48:10] \o/ [06:48:25] * joal sends shuffled claps to elukey :) [06:48:32] IIRC the hdfs journal nodes are also using it, but I haven't restarted them yet [06:48:37] all credits to mr jbond42 [06:48:51] I only used his puppet code [06:48:54] so the idea is the following: [06:49:19] 1) the puppet CA public cert is added, by default, to the openjdk's default truststore [06:50:13] 2) we have some defines in puppet to wrap the host puppet tls cert into a .p12 keystore (with custom pass), and then we use that in ssl-server.xml's config [07:08:44] 2020-09-18 07:06:49,325 WARN org.mortbay.log: failed SslSelectChannelConnectorSecure@0.0.0.0:8481: java.lang.NullPointerException [07:08:47] 2020-09-18 07:06:49,325 WARN org.mortbay.log: failed Server@3c01cfa1: java.lang.NullPointerException [07:08:50] 2020-09-18 07:06:49,326 ERROR org.apache.hadoop.hdfs.qjournal.server.JournalNode: Failed to start journalnode. [07:08:53] java.io.IOException: Problem starting http server [07:08:56] perfect [07:08:57] this is the journalnode, doesn't like the new setting [07:39:29] ok so it seems working now, there was a missing bit [07:44:37] now I am trying to understand if a roll restart of the journals is fine [07:45:06] in theory, the namenode needs to get the new settings first, namely trusting the puppet CA [07:45:27] then the journals can be restarted, since they'll provide a new cert (signed by the puppet CA) [07:46:09] so I think that the critical use cases are [07:46:11] 1) shufflers [07:46:14] 2) journal nodes [07:46:51] the TLS certs are used also for UIs, but it should be fine to wait for the next roll restart for openjdk upgrades [07:47:12] elukey: does that mean UIs will be unaccessible for a while? [07:48:07] joal: the only UI that we care is yarn.wikimedia.org, I need to check but for that use case we can roll restart the RMs [07:48:34] elukey: I regularly use spark UI as well, as well MapReduce sometimes, to troubleshoot [07:49:09] elukey: Those are nodemanger related I assume, and handling shufflers probably means you'll restart them anyway, no?n [07:49:14] what is the mapreduce ui? [07:49:21] yep yep [07:49:27] the spark ui should be fine [07:49:43] but if we see that something is broken we'll kick off our dear cookbooks [07:49:46] and roll restart [07:49:51] ack elukey [07:50:00] elukey: mapreduce UI example: https://yarn.wikimedia.org/proxy/application_1599578418104_35445/ [07:50:29] joal: yeah but that is provided by the RM [07:50:43] elukey: RM doing proxy for NM I think [07:50:54] actually RM doing proxy for AM probably [07:51:39] not sure if it proxies, because when we get to NM level details the proxy is sadly broken.. anyway, node managers need to be restarted for the shufflers so I am sure it will be ok [07:51:51] but let's triple check after we apply the new settings [07:51:57] Ack elukey - thanks :) [07:55:00] Morning! [07:55:30] o/ [07:57:35] getting coffee [07:57:58] just finished making tea :) [08:04:32] elukey: So how does the reimage work? Just boot into PXE? [08:07:24] klausman: it is automated via a script on cumin1001, getting the details [08:08:09] for example: sudo -i wmf-auto-reimage stat1004.eqiad.wmnet -p TXXXXX [08:08:26] Anything else I need to prepare? [08:09:02] what I usually do is to have a tmux session on cumin with the serial console open (the DRAC one), and then another one to run wmf-auto-reimage [08:09:17] yeah, that was my plan [08:09:20] but the script takes care of removing the host from puppet, etc.. [08:09:49] important thing - with the current config d-i should stop and ask for confirmation before applying the partition settings [08:10:06] since we have stat100[4567]) echo reuse-parts-test.cfg partman/custom/reuse-analytics-stat-4dev.cfg ;; \ [08:10:17] (triple checking out loud, I know you already know all this) [08:10:52] we do have 4 disks etc.. [08:10:57] (brb doorbell) [08:11:15] DRAC says we have eight disks. [08:12:40] on 1004? [08:12:58] we have 8 on 1008 IIRC, but 4 on the rest [08:13:00] I may be on the wrong DRAC :) [08:13:32] (I am also checking lsblk -f on stat1004 and it shows 4 sd devs) [08:14:40] Yeah, I was on the wrong DRAC [08:14:54] ok fiuuuu I thought there was some weirdness not accounted [08:15:35] all right so I think we can start (the script also takes case of downtiming etc..) [08:15:45] k, hitting return in 5s [08:15:56] also let's log in #operations [08:17:28] RuntimeError: Must be run in non-interactive mode or inside a screen or tmux. [08:17:34] It *is* in a tmux [08:18:47] interesting [08:19:08] The code it comes from looks okay, and my env has $TMUX nonempty [08:20:33] oooh, does sudo wipe the env? [08:20:49] I usually go with sudo -i [08:21:34] Well, the tmux was started by me, not root, so when you sudo, the EMUX env var disappears [08:21:39] TMUX* [08:21:47] anyway, I fixed it. [08:22:15] sure, I was only saying what I use that works [08:22:17] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts: ` ['stat1004.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-... [08:22:32] ok, rebooting.... [08:22:33] if you fixed it, good :) [08:22:46] what was wrong at the end? [08:23:08] Well, the script insists on $TMUX (or a bunch of other variables) being set [08:23:38] But sudo wipes the environment, so even though I was in tmux, the variable was unset from the script's POV. [08:24:01] So I sudo'd interactively, set the variable manually, then started the script. [08:24:21] it shouldn't be necessary, I have never done it [08:24:45] vOv I dunno. [08:27:15] How do I switch between the tabs in the installer? [08:27:21] so my TERM variable goes from xterm-256color to screen when I am in tmux, and remains even if I do sudo -i bash [08:27:52] But for tmux, it doesn't look at $TERM, but $TMUX. It may be that it's not whitelisted in the sudo config) [08:28:09] could be yes [08:28:10] It seems the partitioner is hanging. [08:28:54] ahhh a good moment to blame kormat :D [08:29:19] there is a script called "install_console" on puppetmaster1001 [08:29:21] that should help [08:29:57] sudo install_console stat1004.eqiad.wmnet [08:31:21] https://phabricator.wikimedia.org/P12649 [08:31:47] in syslog there are some info [08:32:47] ahhh I may know the answer [08:32:58] trying one thing [08:34:03] not really, I thought it was the vlan egress firewall preventing us to reach apt.wikimedia.org for some reason, but it is not [08:34:16] there is a connection timeout that is weird though [08:34:54] klausman: so d-i is hanging basically? [08:34:59] Yep [08:36:43] so maybe reuse-analytics-stat-4dev.cfg is wrong then [08:36:45] It's also not v4 vs v6 being broken (wget -4 does not work, either) [08:37:13] but from the busybox shell right? [08:37:26] It's odd since I suspect during the install it would also fetch packages to install from that host. [08:37:31] Yes, from busybox [08:37:58] yeah it is weird, from stat1005 everything works, also we'd have noticed problems when installing packages etc.. [08:39:50] klausman: so there is a firewall implemented on the routers for the analytics vlan, that was originally thought as protection for production [08:39:55] The v4 IP is the same as during normal operation, the v6 address differs [08:40:08] so traffic from the vlan to production is filtered on junipers [08:40:27] So it can't be the firewall: the host (presumably) had working connectivity to apt, but now it does not, even with v4. [08:41:07] I am wondering if there is something special with d-i [08:41:17] I agree that it worked fine before [08:42:59] you can SSH into the d-i environment from puppetmaster1001: ssh -4 -r /root/.ssh/new_install -o UserKnownHostsFile=/dev/null -o StrictHostChecking=no FQDN [08:43:15] moritzm: I did it with install_console, is it different? [08:43:40] or that, yes [08:43:54] super [08:43:59] I see in syslog something like [08:44:00] Sep 18 08:24:58 log-output: + wget -nv https://apt.wikimedia.org/autoinstall/scripts/reuse-parts.sh -O /lib/partman/display.d/70reuse-parts [08:44:08] and Tobias reported that d-i is hanging now [08:44:09] So wget can do port 80, but not 443 [08:44:17] (on apt) [08:44:24] ah interesting [08:44:40] So it's either a packetfilter somewhere or cert shenanigans [08:44:49] probably d-i waits for interactive input as the reuse partman recipe missed some setting [08:45:12] so the firewall only allow port 80 now that I see [08:45:18] and wget wants https [08:45:23] ah! [08:45:40] So we used http during normal host operation? [08:45:41] but didn't we already reimage a stat host with that recipe? [08:45:50] nope first one [08:45:57] I did kafka jumbo, that is outside the vlan [08:46:01] then I think we found the issue :-) [08:46:15] ok prepping a homer change [08:47:56] klausman: so we have a repo called "homer public", that contains the public configs for most of the junipers config [08:48:10] homer is a cli tool on cumin that allows us to avoid manual commits to the routers [08:48:29] so I am sending a code change for it, and then I'll apply the changes [08:48:57] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/628300 [08:53:51] klausman: https://wikitech.wikimedia.org/wiki/Homer [08:56:28] *nod* [08:56:40] I am committing the changes, in a bit we should be unblocked [08:57:03] I dunno how often wget would retry. should we restart the install or just wait? [08:57:19] I would force a PXE + powercycle [08:57:34] https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_Documentation#Reboot_and_boot_from_network_then_console [08:57:37] from the drac [08:57:55] ok committed, the firewall should be more linient now [08:58:15] yeah, best to restart the reimage [08:58:38] you can simply run: [08:58:52] moritzm: ok as you prefer :) [08:59:05] wmf-auto-reimage --new FQDN, then it will not bail on puppet certs [08:59:23] it should only take an additional few minutes and better ensures a clean state [09:00:03] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts: ` ['stat1004.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-... [09:00:08] Ok, reboot in progress [09:02:13] moritzm: btw, is the installer running in tmux/screen? Or where does the tab-switching functionality at the top come from? [09:03:13] elukey: Ok, in the partitioner. [09:03:27] perfect [09:03:44] which oen do I pick? guided? [09:04:26] no wait it should already be configured with everything, it means that the recipe doesn't work sigh [09:05:51] d-i doesn't use tmux/screen, it's something custom, but would need to dig up what exactly. probably just busybox spawning a few gettys, not sure [09:05:53] It shows the partitions as they are and gives me the usual options (Guided, Confg (swraid, lvm, encryption, iscsi), Undo, Finish [09:06:17] ah okok sorry I misunderstood [09:06:44] so in theory we should check that all the bits that we care are in place, namely /srv with "keep" and / with "format" [09:06:46] Talking with kormat in parallel [09:07:39] klausman: the idea is that /srv needs to be kept intact, so you should see in d-i the same config that we have now, with "keep" [09:08:16] Ah "K" vs "F", how obvious [09:08:18] and eventually just hit "finish" [09:09:18] Ok, K for everything except / [09:09:33] sgtm, continuing [09:09:36] +1 [09:10:00] Ah, *now* it tells me in a useful way [09:10:47] https://phabricator.wikimedia.org/F32354409 [09:11:10] d-i. Such a marvel, such a pita [09:19:55] and it's back [09:20:13] gooood [09:20:27] Will the host key be uploaded automagically? [09:21:05] looks like it. [09:21:40] Oh, it only got removed, not re-added. [09:22:08] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['stat1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['stat1004.eqiad.wmnet'] ` [09:22:23] Failed!? [09:23:20] yes one of the last steps should be to sign the new key [09:25:17] what errors do you see from the reimage? [09:25:18] 2020-09-18 09:20:58 [INFO] (klausman) wmf-auto-reimage::print_line: Unable to run wmf-auto-reimage-host: Failed to puppet_first_run [09:25:26] ah lovely [09:26:06] ok so from install console, puppet gives me [09:26:07] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed getting spark version via facter. (file: /etc/puppet/modules/profile/manifests/hadoop/spark2.pp, line: 101, column: 9) on node stat1004.eqiad.wmnet [09:26:22] I already seen this and never really had time to fix it [09:26:37] So what is the underlying issue? [09:27:15] it is a puppet profile that we should fix, could be a good occasion :) So to unblock things, I am going to just install manually spark2 and then kick off puppet again [09:27:36] klausman: check profile::hadoop::spark2 [09:27:53] it is included in a lot of common things that we deploy on client/servers for hadoop [09:28:00] it contains [09:28:01] # Get spark_verison from facter. Fail if not set. [09:28:01] $spark_version = $::spark_version [09:28:01] if !$spark_version or $spark_version == '' { [09:28:01] fail('Failed getting spark version via facter.') [09:28:03] } [09:28:25] this is fine if spark is already installed, and puppet's facter knows about it [09:28:37] but the first puppet run comes up empty [09:29:03] Ah [09:29:15] So how do we fix stat1004? [09:30:06] so I am in install_console atm, and I just did apt-get install spark2 + puppet agent -tv [09:30:09] that works now [09:30:26] the check above relies on modules/profile/lib/facter/spark_version.rb [09:30:49] I think that the original intent was to avoid manual declarations of what version of spark we use [09:31:07] so I am going to complete the puppet run on install console [09:31:18] Alright. [09:31:38] as follow up, before 6 and 7, we could fix the spark 2 issue [09:32:48] maybe in the .rb file we could add a check for apt-cache show spark2 or similar, if dpkg comes up empty [09:33:01] (or even cache policy) [09:36:10] I presume /srv is intact and well? [09:37:21] haven't checked it, still running puppet [09:40:36] I'd check, but install_console doesn't let me SSH to 1004 [09:41:22] it is still going through package installs, should deploy the ssh keys soon [09:50:00] 10Analytics, 10Patch-For-Review: Fix TLS certificate location and expire for Hadoop/Presto/etc.. and add alarms on TLS cert expiry - https://phabricator.wikimedia.org/T253957 (10elukey) The new settings are working on the Testing cluster as far as I can see, really nice! Procedure wise, this is what I'd do:... [10:00:51] (puppet still running) [10:02:49] /dev/mapper/stat1004--vg-data 7.2T 2.5T 4.4T 36% /srv [10:02:51] :) [10:03:12] (killed/restarted puppet, it gets stuck somewhere) [10:08:25] ack [10:08:34] heading for lunch and a few errands, bbiab [10:25:23] klausman: ok so stat1004 is ready [10:25:43] one caveat - the /home dir should be a symlink of /srv/home [10:28:28] ok what I did is [10:28:33] mv /home /home-backup [10:28:40] ln -s /srv/home /home [10:28:58] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10elukey) The first puppet run on stat1004 highlighted some issues that might need some work before 1006 and 1007's reimages: ` Error: Execution of '/usr/bin/apt-get -q -y -o DPk... [10:29:12] (I see that you are logged in) [10:29:37] 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) >>! In T262202#6451901, @Milimetric wrote: >>>! In T262202#6451136, @jijiki wrote: >> @Milimetric my question... [10:31:43] elukey: roger. Should I send the all-clear mail? [10:32:49] klausman: please do a quick check first just to make sure that I am not crazy, but we should be good [10:33:17] The machine looks good to me. [10:33:41] DO we need to point people somewhere for updates to their venvs? Or would they know? [10:35:24] in theory they should know, in practice we'll follow up with people having troubles (if any) during the next hours/days [10:35:44] Roger. [10:35:44] I created ssh stat1004.eqiad.wmnet -L 8000:stat1004.eqiad.wmnet:8000 and tested kicking off a notebook, looks fine [10:36:16] there is ssh stat1004.eqiad.wmnet -L 8000:stat1004.eqiad.wmnet:8000 [10:36:18] err [10:36:21] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Resetting_user_virtualenvs [10:36:26] that may be useful [10:37:15] ah snap not great, I can only see python3 notebooks available [10:37:56] Are people still using Py2 notebooks? [10:38:10] nono I meant that there should be spark etc.. [10:38:46] !log force ./create_virtualenv.sh in /srv/jupyterhub/deploy to update the jupyter's default venv [10:38:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:39:03] I think that since it was under /srv it got preserved [10:39:33] right [10:40:17] yep fixed :) [10:40:26] Ok, will send mail now [10:41:53] need to go now, bb in ~2h [12:11:30] So I'm a bit puzzled by the spark check. AIUI, it tries to figure out what the version of Spark is, and fails if it can't find any. Wouldn't a better course of action be to install it? [12:41:48] klausman: the idea is (I think) to be able to populate modules/profile/templates/hadoop/spark2-defaults.conf.erb automatically (there is a reference of the spark version in there - but also possibly in other places if needed) [12:42:15] so if we upgrade one host and not the others the new version is picked up fine [12:42:20] (with libs etc..) [12:42:32] not saying it is perfect, it saves some hiera configs [12:42:40] but if you find a better solution please propose :) [12:48:40] \o/ better laptop arrived [12:48:55] As for a better solution, I am not Puppet-competent enough (yet) [12:49:48] sure but a fresh view sometimes leads to simpler/better solutions :) [12:50:13] I need to bring my car to change two tires, I hope to be back in ~30/40 mins (nothing horrible but unexpected) [12:50:52] roger [12:55:16] from what I can tell, one option would be to simply define the Spark version in Hiera [13:28:33] * elukey back [13:31:27] ottomata: o/ morninggg [13:31:38] we have a spark2-puppet qs whenever you have time [13:39:12] heyllooooo [13:39:16] yes elukey how goes ask meee [13:39:29] i saw your stat1004 puppet issues [13:39:37] didn't fully undertsand why puppet can't just install the stuff [13:41:45] I am wondering if declaring the package { 'spark2': } resource explicitly in the class + relying on the puppet ordering sufficie [13:41:49] *suffice [13:42:53] there is no real need for require_package in there no? [13:43:16] I didn't check that part before, I focused on the fail() bit [13:46:09] https://puppet.com/docs/puppet/5.5/lang_containment.html is interesting [13:46:26] "However, unlike resources, Puppet does not automatically contain classes when they are declared inside another class. This is because classes can be declared in several places via include and similar functions. Most of these places shouldn’t contain the class, and trying to contain it everywhere would cause huge problems." [13:47:18] require_package() IIRC creates a new class that adds the package resource, but given what written above it might not necessarily happen as we expect [13:48:50] hmmm, elukey but i think it requires the class, rather than just includes it [13:49:15] which means that the class must evaluate and succeed before the declaring one does [13:49:17] rigiht? [13:49:37] hie teammm! [13:49:45] https://puppet.com/docs/puppet/6.17/lang_relationships.html#lang_rel_require [13:49:51] hellooo mforns ! [13:49:54] :] [13:50:42] elukey: whta ffile are you talking about? [13:50:55] oh you just made a patch looking! :) [13:51:07] elukey: the reason for require_package vs just package [13:51:13] is to avoid duplicate declarations [13:51:27] there shouldn't be any no? [13:51:29] if package { 'spark2': } is anywhere else on the same node [13:51:30] it will fail [13:51:38] welll, its kind just best practice to use require_package i think [13:51:50] it has some downsides, though [13:51:52] i don't understand why require_package doesn't work? [13:52:01] not sure, I think it is best to know exactly where the package is deployed [13:52:12] is the problem a confliect with profile::hadoop::common stuff? [13:52:12] it's gets evaluated early. so if a package depends on a compoentn or repo, that will fail [13:52:30] ensure_packages also handles duplicates resources [13:52:35] ah i see [13:52:47] ensure_packages might be good here then moritzm ? [13:53:07] since it won't conflict with the require of another class (which I think does indeed set up some apt source components) [13:53:30] we should deploy spark2 only in that profile though [13:53:36] is it present elsewhere? [13:55:05] it's an option, but Luca's angle seems cleaner in fact [13:55:51] really? huh, i always avoid package resource if I can, but if you say so! [13:55:59] elukey: you can always wrap it in a if !defined before declaring it [13:56:46] ahahha thanks ottomata for the trust :D [13:58:32] (03PS1) 10Ottomata: Reindent refinery-camus scala files with 4 spaces [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/628334 [14:08:14] ottomata: about https://phabricator.wikimedia.org/T253957 - I am inclined to switch to puppet host certificates for hadoop/presto, if you are ok I'll schedule the maintenance to switch next week [14:10:19] elukey: soudns good!@ [14:10:54] (03CR) 10Ottomata: [C: 03+2] Reindent refinery-camus scala files with 4 spaces [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/628334 (owner: 10Ottomata) [14:11:57] whee, new laptop (this time with an almost decent keyboard) [14:12:04] \o/ [14:12:21] so klausman maybe the issue with spark2 is fixed, that unblocks other reimages [14:12:26] we'll see if it re-happens [14:12:28] Nice. [14:13:02] we have two options - we keep using the "reuse-parts-test.cfg" settings that need a confirm before proceeding [14:13:13] klausman: could you update the ssh fingerprint for stat1004 (https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1004.eqiad.wmnet)? i assume it changed in the re-imaging and my ssh isn't happy [14:13:25] or we go straight to reuse-parts.cfg in netboot, so ideally no need for confirmations [14:13:33] I got it via the update-known-hosts script from the wiki [14:13:53] But yes, will update that page [14:14:42] ahh -- not aware of that script. i was just going to manually verify and then remove the old key from my known_hosts [14:14:51] thanks [14:15:32] Also, I can't edit the page, it's protected [14:15:50] BTW, there's an Icinga alert for jumbo1008, known? "Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]" [14:16:03] Sounds like a broken PSU [14:16:50] yeah, so for some broken hardware [14:16:57] specifically disks [14:17:13] we have automation which creates a Phab task when the error is noticed [14:17:23] Maybe the machine just doesn't have a secondary PSU? [14:17:51] unlikely, it might also be that during the recent PDU maintenance something was not properly connected [14:18:08] At a previous employer we never had the second PSU slot populated since redundancy was created elsewhere, and the second PSU noticably increased power consumption [14:18:09] in such a case best to open a Phab task and tag is "ops-eqiad" [14:18:49] then some DC ops can investigate the next time they are on site (they won't be on a Friday, otherwise just pinging them on IRC is also an option) [14:19:02] joal: yt? [14:19:11] wanna brain bounce about camus and canary events stuff [14:19:55] klausman: ahh...yes, seems you need admin permissions. no idea how to get those so i'll request ottomata do it when he gets the chance :) https://wikitech.wikimedia.org/w/index.php?title=Special:ListUsers/sysop&limit=2000 [14:20:09] am i an admin? [14:20:10] lets seee [14:20:18] (updating fingerprints for stat1004 at https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1004.eqiad.wmnet) [14:20:18] oh i see i am! [14:20:19] oook [14:20:43] klausman: if you want please open the task to ops-eqiad [14:20:52] will do [14:23:17] rats i can't just add you as an admin [14:24:21] Created T263262 [14:24:21] T263262: Check jumbo1008.eqiad.wmnet PSU setup - https://phabricator.wikimedia.org/T263262 [14:28:13] perfect thanks [14:47:21] are all the refine failed flags for eventlogging_Test ok? [14:47:35] (there is an icinga alert for the related monitor, I'll reset-failed in case) [14:52:59] 10Analytics: Separate RSVD anomaly detection into a systemd timer for better alarming with Icinga - https://phabricator.wikimedia.org/T263030 (10mforns) @ssingh Oh, cool. :] Maybe we can even leave the spark job running as is now (with very few changes on our side), and in the systemd timer job, just check for t... [14:53:47] (I'm around, I'm just quiet because scala is kicking my ass) [14:55:18] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Add editors per country data to AQS API (geoeditors) - https://phabricator.wikimedia.org/T238365 (10mforns) This is solved now, right? [14:56:48] maybe that's the problem, you should be shouting at it :] [14:59:48] mforns: he is quiet on IRC, maybe he is swearing in Romanian and English at the same time at home :D [15:00:05] heheh [15:01:15] I'm trying so hard to contain my anger [15:01:16] so hard [15:01:28] (03PS1) 10Ottomata: eventstreamconfig.py - remove custom logic for computing topic lists [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628349 (https://phabricator.wikimedia.org/T251609) [15:01:35] elukey: recent alarms? [15:01:42] or you mean from a few days ago? [15:02:10] whatever it is, yes it is fine [15:02:14] that is just Test canary data [15:04:12] (03PS2) 10Ottomata: eventstreamconfig.py - remove custom logic for computing topic lists [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628349 (https://phabricator.wikimedia.org/T251609) [15:05:09] ottomata: I think last 48 hours [15:05:20] right ok [15:05:48] !log systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 to clear icinga alrms [15:05:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:05:52] elukey: do you know if joseph is off today? [15:10:12] ottomata: I had a chat with him this morning, but not in the afternoon [15:10:25] ok so he will prob be on later coo, looking for my brain bounce buddy :) [15:10:25] ty [15:13:01] RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:18:07] (03PS3) 10Ottomata: eventstreamconfig.py - remove custom logic for computing topic lists [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628349 (https://phabricator.wikimedia.org/T251609) [15:23:53] Alright, will do some laptop setup and then head into the weekend 👋 Have a nice Friday and a splendid weekend, everyone! [15:26:57] I think I'm gonna get an ulcer trying to transform this struct [15:28:17] klausman: o/ [15:31:24] milimetric: can I help maybe? [15:31:38] l8rs klausman :) [15:34:25] mforns: you avail for a brain bounce? [15:34:39] I have tried to learn spark/scala by just writing some code, like I do with every other language. And I run into walls and walls and walls of documentation using words I don't understand and showing ten different ways to do simple things, and I just don't have any idea how to learn it. Intuitively it feels like I would need years to stare at it before it makes any sense. And that immediately makes me angry because it's not [15:34:39] the case with other languages, and then I kind of spiral on that for a while and just give up and copy paste some shitty code from somewhere without understanding any more than when I started. So I guess I'm saying I'm failing at this, I don't understand how to succeed, and I don't understand *why* I would want to, since it just seems like it shouldn't be this complicated. [15:34:47] so, in short, no, I think I need therapy [15:35:06] hahaha milimetric ! [15:35:10] scala docs do suck indeed [15:35:17] sparks have been ok for me though [15:35:36] buuut yeah transforming spark structs sometimes can be crazy for sure [15:36:02] i still don't think i've fully understood the spark Column api stuff [15:36:35] strongly typing things to this extent just exponentially increases complexity. And without proper planning of the type system, like C# has, it's miserable [15:37:06] i guess? you say C# is magical bu ti have not used so I have to take your word for it [15:37:17] but, is the problem here just typing? or is it typing in a distributed system? [15:37:30] these distributed system abstractions can make things extra weird [15:37:44] since there's so much serialization and deserialization happening everywhere [15:37:48] so the typing is really important [15:38:15] you so sure that a C# version of spark wouldn't be equally frustrating? [15:38:16] :) [15:39:29] I think it's typing, yeah, take this: [15:39:36] struct("a", "b", "c") [15:39:54] org.apache.spark.sql.Column = named_struct(NamePlaceholder(), a, NamePlaceholder(), b, NamePlaceholder(), c) [15:40:02] what is NamePlaceholder()?!!! [15:40:12] I have to look that up, then I'm in that rabbit hole for a while [15:40:18] no idea milimetric but what are you trying to do, struct is a sql api function i think [15:41:08] I really think I should figure it out for myself, I mean we seem like we're not going to change our mind, so maybe I just need to take like a few months and really learn it [15:41:43] otherwise it just feels like everyone's talking a different language, maybe I have trauma about that from when I was a kid [15:42:16] milimetric: is what you want named_struct [15:42:16] ? [15:42:21] instead of struct? [15:42:53] struct takes Columns as args, not strings, [15:42:54] I've literally no idea... and no idea how I would know what I want without reading every single one, trying them out, playing with them, reading lots of unrelated docs [15:43:06] no! you pass it strings! Try it :) [15:43:12] you CAN pass it strings [15:43:29] but I think that makes spark interpret them later as names in some way in a sql context? [15:43:31] column names* [15:43:36] I'm gonna stop now and just read. If I'm not at standup for the next few months, just tell people I'm still reading [15:44:09] haha milimetric it does sound like you just need to do understand some spark sql api stuff [15:44:13] would be good to read stuff [15:44:53] https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Column.html [15:44:59] Column is sort of a SQL abstraction [15:46:51] milimetric: i betcha named_struct is what you want [15:46:52] https://spark.apache.org/docs/2.3.0/api/sql/index.html#named_struct [15:47:52] heh, I clicked on that first link. I now have five tabs open, and I feel like it'll be more like a year than a few months... [15:49:27] 10Analytics-Clusters: Upgrade to Superset 0.37.x - https://phabricator.wikimedia.org/T262162 (10elukey) Created today https://github.com/apache/incubator-superset/issues/10956 with the help of upstream. [15:59:50] Hey ottomata - sorry was gone for kids [16:00:49] ottomata: heyyy, sorry missed your ping, I'm available if you still need [16:03:32] milimetric: If you wish I can give you my understanding, but I'm not sure it;s what you need [16:04:10] I wish that, but do not think it's possible. I think I need to understand it myself. And it seems like I have to read a few books to do that. [16:05:43] milimetric: i spark makes your life hard if you think is a programming language [16:06:11] milimetric: it is really more like super high abstraction over an execution framework [16:07:00] I'm not sure I'm thinking of it as either, I am just trying to understand how to use it. Picture a caveman looking at a screwdriver, that's me right now. [16:07:50] milimetric: you write 100 lines and 20 of them execute one way (master node) and 80 of them execute other way (worker node) and that is not apparent at all by the way it is written , by abstracting that , i think, things are so far away from what is really happening that in teh absence of reading docs you just cannot figure it out [16:08:35] milimetric: scala (alone) , no spark for someone with EXTREME javascript chops is probably a lot more familiar [16:08:47] yeah, I can't figure it out. And I feel like I've suffered enough just pretending like at some point I'll figure it out. I'd like to stop suffering now, since it doesn't look like spark or scala are going away anytime soon, and I don't plan on going away anytime soon [16:09:24] milimetric: in a way, spark is the ultimate spaguetty code [16:09:46] milimetric: I'm around for about an hour, ready to help if you wish - ready not to help if it's better :S [16:10:16] no, seriously, I literally need to read a few books, and it's going to take me a long time. I just don't see any other way [16:10:58] * joal sends book-love to milimetric [16:11:22] ottomata: do we take some time now? canary events + prez? [16:12:53] milimetric: for scala (and functional code) https://www.scala-exercises.org/ might be of help, now, to be fair, for spark it doesn't help at all, i think approaching spark as a language is not helpful, it is an execution framework over gigantic arrays that are distributed in hundreds of machines [16:13:09] milimetric: for me changing that mindset was key [16:13:15] not that i know any spark at all [16:13:30] but realizing that was real helpful [16:14:26] joal: ya! 2 mins bc [16:14:31] sure ottomata [16:51:47] elukey: i shoudl file a ticket for the sudo -u www-data chnages correct? [16:52:25] nuria: I can file a code change on monday, maybe ping me if I forget (taking notes) [16:52:35] elukey: will file task [16:53:19] 10Analytics-Clusters, 10Analytics-Kanban: analytics-admins should be able to sudo -u www-data in analytics systems - https://phabricator.wikimedia.org/T263272 (10Nuria) [16:53:49] 10Analytics-Clusters, 10Analytics-Kanban: analytics-admins should be able to sudo -u www-data in analytics systems - https://phabricator.wikimedia.org/T263272 (10Nuria) a:03elukey [16:54:08] elukey: done, taht we way we do not forget [16:58:17] 10Analytics, 10Analytics-Kanban, 10good first task: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171 (10Nuria) 05Open→03Resolved [16:58:19] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167 (10Nuria) [17:02:04] oops sorry ottomata - too fast to click :) [17:02:15] sok have a good weekend! [17:04:03] Gone for tonight team - have a good weekend :) [17:05:29] me too o/ [17:12:36] byeeeeee [17:51:29] hey a-team: I'm getting "accept: Too many open files" as an error to my SSH terminal on stat1008, and Jupyter fails to save files correctly [17:51:52] it's recurring, but I'm unsure what's triggering it and if it's something I'm doing [17:54:50] Nettrom: hm, does it always happen? [17:54:51] or only sometimes? [17:55:44] ottomata: only sometimes, and it's not clear when it starts to show up [17:55:48] 10Analytics, 10CAS-SSO, 10User-MoritzMuehlenhoff: Allow login to JupyterHub via CAS - https://phabricator.wikimedia.org/T260386 (10MoritzMuehlenhoff) [17:57:35] Nettrom: is it only jupyter? or is it ssh terminal too (you may have just said this) [17:57:39] when it happens, does it happen in both? [17:58:12] ottomata: I think it might be connected to Jupyter in some way, e.g. when it tries to autosave the notebook I'm working in [17:58:31] I get an error about my server not running in Jupyter, and see the "too many files" in my SSH terminal [18:00:55] oh, your server not running? [18:00:56] that is weird [18:01:03] Nettrom: in jupyterhub can you restart your server? [18:02:09] ottomata: I'd prefer not to, as I have an R-session that's been running for two days and I'm waiting to finish [18:02:21] hm [18:02:59] I can pick this up again over the weekend if the problem persists [18:06:33] Nettrom when you say in my SSH terminal, do you mean the jupyter terminal? or an actuall login terminal? (Just double checking,i think you mean the latter) [18:06:40] the latter [18:06:56] hm [18:09:16] Nettrom: is it interfering with your running R script? [18:10:59] ottomata: the one that's been running for two days? difficult to say, those models sometimes do take two days to finish [18:13:33] Nettrom: that is strange, i can't see why you'd get that [18:13:48] non of your processes are opening too many files [18:13:53] oh wait, maybe [18:18:03] yeah [18:18:21] your main jupyter server process is using quite a few files [18:18:21] 1476 [18:18:31] the process soft limit is 4096 though [18:18:46] i'm not sure how that interacts with the default ulimit (i guess for a shell) which says 1028 [18:18:47] but [18:18:53] 1024* [18:19:02] it shoudl be the process limit that is enforced [18:19:23] ottomata: iflorez was also having issues with jupyter files the other day I think [18:19:27] hmmmmm [18:19:32] actually the number does seem to increase [18:19:34] slowly [18:19:48] maybe occasionally it is hitting the limit! [18:20:23] yeah, that it occasionally hits the limit sounds about right [18:24:34] Nettrom: are you using old jupyter interface, or jupyterlab? [18:30:11] Nettrom: i dunno what is going on but your notebook server is indeed opening a ton of sockets [18:30:18] looks like a bug with jupyter something [18:30:27] some stuff I found seems to indicate maybe it is jupyterlab [18:30:29] but not sure [18:34:17] occasionally if it opens too many, e.g. > 4096, uou'll get that error [18:55:58] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Add editors per country data to AQS API (geoeditors) - https://phabricator.wikimedia.org/T238365 (10Nuria) 05Open→03Resolved [18:56:01] 10Analytics, 10Analytics-Wikistats: Wikistats 2.0: Add statistics for the geographical origin of the contributors - https://phabricator.wikimedia.org/T188859 (10Nuria) [19:09:34] (03PS3) 10Jenniferwang: Add SpecialMuteSubmit schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628235 (https://phabricator.wikimedia.org/T262499) [19:50:17] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features - https://phabricator.wikimedia.org/T260442 (10Nuria) 05Open→03Resolved [19:58:05] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Use MaxMind DB in piwik geo-location - https://phabricator.wikimedia.org/T213741 (10Nuria) piwik seems to be reading this well and i can see files on nuria@matomo1002:/usr/share/matomo/misc, closing [19:58:13] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Use MaxMind DB in piwik geo-location - https://phabricator.wikimedia.org/T213741 (10Nuria) 05Open→03Resolved [20:02:07] (03CR) 10Nuria: [C: 03+2] Add SpecialMuteSubmit schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628235 (https://phabricator.wikimedia.org/T262499) (owner: 10Jenniferwang) [20:02:11] (03CR) 10Nuria: [V: 03+2 C: 03+2] Add SpecialMuteSubmit schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628235 (https://phabricator.wikimedia.org/T262499) (owner: 10Jenniferwang) [20:03:00] (03CR) 10Nuria: [V: 03+2 C: 03+1] "Sorry, +1, just space issue needs to be corrected." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628235 (https://phabricator.wikimedia.org/T262499) (owner: 10Jenniferwang) [20:14:50] 10Analytics: Retrofit event pipeline with bot detection code - https://phabricator.wikimedia.org/T263286 (10Nuria) [20:15:01] 10Analytics: Retrofit event pipeline with bot detection code - https://phabricator.wikimedia.org/T263286 (10Nuria) [20:15:33] 10Analytics, 10Product-Analytics, 10Platform Team Initiatives (Modern Event Platform (TEC2)): Retrofit event pipeline with bot detection code - https://phabricator.wikimedia.org/T263286 (10Nuria) [20:17:38] 10Analytics, 10Pageviews-API: Pageviews for "Special:Contributions/USERNAME" not working: "Error querying Pageviews API - Not found" - https://phabricator.wikimedia.org/T244639 (10Nuria) 05Open→03Declined a:05Nuria→03None [20:22:46] 10Analytics, 10Pageviews-API: REST API pageviews won't fetch / incorrectly fetching using URL - https://phabricator.wikimedia.org/T262742 (10Nuria) Closing, working fine as long as title matches case wise and project is expressed like es.wikipedia .. etc [20:22:53] 10Analytics, 10Pageviews-API: REST API pageviews won't fetch / incorrectly fetching using URL - https://phabricator.wikimedia.org/T262742 (10Nuria) 05Open→03Resolved [20:47:11] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Epic: Add data quality alarm for mobile-app data - https://phabricator.wikimedia.org/T257692 (10Nuria) Will run some experiments on whether entropy per os for mobile-apps requests is a good timeseries, maybe will try as well entropy per access_method... [20:52:17] 10Analytics, 10Operations, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10CDanis) [21:16:11] 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10Milimetric) > Requests bearing the X-Wikimedia-Debug header passthrough the caches but they endup in varnishkafka and... [21:28:59] (03PS1) 10Ottomata: [WIP] Add option to use Wikimedia EventStreamConfig to get kafka topics to ingest [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/628447 (https://phabricator.wikimedia.org/T251609) [22:30:55] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Operations, 10Wikimedia-production-error: Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount - https://phabricator.wikimedia.org/T263132 (10jeena) Various jobenqueue errors happened today in the past 6 hours with spikes of 1... [22:34:34] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Operations, 10Wikimedia-production-error: Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount - https://phabricator.wikimedia.org/T263132 (10thcipriani) p:05High→03Unbreak! >>! In T263132#6475784, @jeena wrote: > Various... [23:15:33] 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10serviceops, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10Krinkle) Based on what I've seen in the past, I believe local testing or bulk testing is generally done directly towa...