[05:22:18] (03PS4) 10Legoktm: Update for Buster, refresh packaging [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 (owner: 10Majavah) [05:31:32] (03PS5) 10Legoktm: Update for Buster, refresh packaging [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 (owner: 10Majavah) [05:34:53] (03PS1) 10Legoktm: Delete gbp.conf, use default options [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668597 [05:36:24] (03CR) 10Legoktm: [C: 03+2] "I added Mortiz's changelog entry for completeness and then dropped "debhelper" from Build-Depends since it's implied from debhelper-compat" [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 (owner: 10Majavah) [05:36:32] (03Merged) 10jenkins-bot: Update for Buster, refresh packaging [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 (owner: 10Majavah) [05:37:27] (03PS2) 10Legoktm: Delete d/gbp.conf and d/files, use default options [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668597 [05:37:29] (03CR) 10Legoktm: [C: 03+2] Delete d/gbp.conf and d/files, use default options [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668597 (owner: 10Legoktm) [05:37:34] (03Merged) 10jenkins-bot: Delete d/gbp.conf and d/files, use default options [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668597 (owner: 10Legoktm) [06:09:57] (03PS1) 10Legoktm: Fix packaging [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668599 [06:11:28] (03CR) 10Legoktm: [C: 03+2] Fix packaging [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668599 (owner: 10Legoktm) [06:11:36] (03Merged) 10jenkins-bot: Fix packaging [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668599 (owner: 10Legoktm) [06:19:50] (03CR) 10Legoktm: Update for Buster, refresh packaging (032 comments) [analytics/udplog] - 10https://gerrit.wikimedia.org/r/668451 (owner: 10Majavah) [07:00:38] (03CR) 10Joal: [C: 03+2] "All good :) Thanks for the patches - merge when you wish" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) (owner: 10Mforns) [07:00:53] good morning [07:01:13] !log stop hadoop daemons on analytics1066 - disk errors on /dev/sdb after reimage [07:01:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:13:42] !log add analytis1066 back with /dev/sdb removed [07:13:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:18:33] it was already not added, weird I don't see a task for a broken disk [07:21:27] joal: bonjour :) ok if I run the systemd timer to drop the druid public datasource? [07:21:32] to see how it goes [07:22:51] !log drain + reimage analytics107[0-1] to debian buster [07:22:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:42:59] (03PS2) 10Lex Nasser: Fix and optimize Hive query and change field names in properties file for top-per-country job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/668236 (https://phabricator.wikimedia.org/T207171) [07:45:12] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['analytics1070.eqiad.wmnet', 'analytics1071.eqiad.wmnet'] ` The log can be found in... [08:18:10] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1071.eqiad.wmnet', 'analytics1070.eqiad.wmnet'] ` and were **ALL** successful. [08:32:08] !log drain + reimage an-worker107[8,9] to Debian Buster (one Journal node included) [08:32:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:42:35] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1078.eqiad.wmnet', 'an-worker1079.eqiad.wmnet'] ` The log can be found in... [09:45:16] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1079.eqiad.wmnet', 'an-worker1078.eqiad.wmnet'] ` and were **ALL** successful. [09:46:42] officially more hadoop worker nodes on buster than on stretch :) [09:46:51] 41 vs 37 [09:53:24] very weird, the namenode failedover [09:54:37] ah snap I think it spend too much time in GC [09:55:24] probably time to bump the heap size [10:06:54] !log failover HDFS Namenode from 1002 to 1001 (high GC pauses triggered the HDFS zkfc daemon on 1001 and the failover to 1002) [10:06:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:19:37] created https://gerrit.wikimedia.org/r/c/operations/puppet/+/668659 [10:20:23] !log force run of refinery-druid-drop-public-snapshots to check Druid public's performances [10:20:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:24:36] datasource dropped, so far no sign of troubles, metrics look good [10:24:40] wikistats is ok as well [10:25:31] joal: I think we did it!!! [10:25:32] * elukey dances [10:56:10] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Dropping data from druid takes down aqs hosts - part 2 - https://phabricator.wikimedia.org/T270173 (10elukey) Forced a data drop on Druid public and nothing really happened, the problem seems gone! [11:17:26] * elukey lunch! [12:22:38] Hi elukey - sorry I've been taking disconnec time this morning - No problem at datasource drop feels like a huge win :) You rock elukey :) [13:11:08] (03CR) 10Joal: "Comment about comment :)" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/668236 (https://phabricator.wikimedia.org/T207171) (owner: 10Lex Nasser) [13:23:04] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Thanks for the reviewwww, Joal!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/664885 (https://phabricator.wikimedia.org/T273116) (owner: 10Mforns) [13:31:00] joal: when you have a moment - https://gerrit.wikimedia.org/r/c/operations/puppet/+/668659/ [13:31:49] the beast needs to be fed :D [13:32:02] uhuh [13:32:49] well - let's do it :) [13:33:19] elukey: I also wish we work toward reducing file-numbers [13:34:09] joal: ah yes I agree :) [13:36:31] !log roll restart HDFS Namenodes for the Hadoop cluster to pick up new Xmx settings (https://gerrit.wikimedia.org/r/c/operations/puppet/+/668659) [13:36:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:00:18] ok the failover from 1002 to 1001 didn't work, weird [14:00:20] the cookbook failed [14:00:30] meh? [14:03:23] joal: second attempt worked [14:05:56] that is not super great, but I think that the first failed since an-master1001 became unhealthy when doing the failover [14:06:20] I waited a lot of minutes, maybe it was too soon, from now on the failovers might need more than 5 mins of wait time [14:07:10] ok I am going to wait 10/15 mins before restarting the NN on 1002 [14:07:13] to complete the procedure [14:20:03] ok restarted 1002 [14:23:43] ottomata: nice reduction with the R package removal! [14:24:05] (sorry for the reviews, didn't had the time to review them in depth and I didn't want to slow you down :( ) [14:25:18] s'ok i'm mostly adding you as reviewers for reference and/or objections [14:28:43] ack, I feel super ignorant about the conda stuff, I'll have to review it sooner or later [14:29:02] when you have a moment next week I'd like to pick your brain on https://github.com/criteo/tf-yarn [14:29:25] it uses the cluster-pack/conda-pack thing, I am wondering if we could use it after we add GPU labels in yarn [14:29:39] (maybe adapting it to the conda work that you did) [14:29:52] I'll also ping Fabian [14:30:19] (I hate all those GPUs getting dust :D) [14:35:03] elukey: ya i read a bunch of that code [14:35:11] it is much more flexible and cool than what i wrote [14:35:21] it is able to detect if local packages have changed and re-upload to yarn [14:35:22] but [14:35:30] it is missing some things we need (probably could do a pull request) [14:35:44] and ultimately, for the conda pack stuff, it doesn't do much other than what I wrote [14:52:03] ottomata: hi! do you have a moment for a chat about session length? [15:02:07] ottomata: yep yep I asked since I was wondering if your code could fit in, and it seems so, good :) [15:07:27] !log drain + reimage analytics1073 and an-worker1086 to Debian Buster [15:07:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:12:43] joal: is it just me or did hdfs dfs -cat / hdfs dfs -text used to work for parquet files and doesn't anymore? [15:12:55] milimetric: nope, never worked :) [15:13:06] milimetric: parquet data is not to be visible in text [15:13:19] I could've sworn it did :) ok, just me, I thought we had hooked something up to read it [15:13:29] maybe that was avro [15:19:27] I'm back in fdans [15:20:03] mforns: creo que mi internet se ha jodido [15:20:30] ops, ok, no pasa nada [15:35:47] Hi team, g'day [15:39:05] !log rebalance kafka partitions for webrequest_upload partition 10 [15:39:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:39:32] yoohoo [15:39:52] elukey: razzi yall doing labsdb today? [15:40:16] That's the plan! [15:40:24] I'm ready to start whenever [15:49:12] razzi: I am here to help/assist if you need :) [15:49:43] ok cool! I'll get started [15:50:16] Steps are at https://phabricator.wikimedia.org/T269211#6883946, here I go... [15:50:32] :) [15:50:44] razzi: one nit - there is still a reference of "Analytics vlan" in the steps, remember that it should be cloud-etc.. [15:52:19] Updated! Thanks elukey [15:52:41] !log stop mariadb on labsdb1012 [15:52:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:53:24] razzi: also ping people in #databases so they know :) [15:54:06] and if possible also !log in #operations for visibility [15:55:37] Ok, messaged to #wikimedia-databases, will log here and in operations [15:56:44] super [16:00:18] razzi: one qs - in netbox, is Device: labsdb1012 correct? Or does it need to be clouddb1021? [16:01:28] elukey: good question, clouddb1021 makes more sense I think since by then I'll have already renamed the dns name to clouddb1021 [16:01:50] it is not clear in the docs but yeah clouddb looks more reasonable, I'll dig into it [16:03:38] yes yes I think we need to use clouddb1021 in there [16:05:58] ok cool, thanks for the catch [16:08:35] !log sudo cookbook sre.hosts.decommission labsdb1012.eqiad.wmnet -t T269211 [16:08:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:08:38] T269211: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 [16:18:58] nice cookbook completed :) [16:19:44] klausman: looks like you have a commit awaiting puppet-merge: ml-ctrl: Add dummy keys for ML k8s control plane [16:20:10] will merge in a New York minute [16:20:11] Seems like puppet-merge got smarter; it asked if I wanted to merge those, then asked if I wanted to merge mine, rather than making me merge all at once [16:20:43] Either is fine by me, just lmk :) [16:21:15] Maybe because those are in secrets module actually, and mine were public puppet [16:21:28] Yeah, otherwise it'd rewrite history [16:28:48] !log rename https://netbox.wikimedia.org/ipam/ip-addresses/734/ DNS name from labsdb1012.mgmt.eqiad.wmnet to clouddb1021.mgmt.eqiad.wmnet [16:28:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:30:35] !log delete non-mgmt interfaces for labsdb1012 at https://netbox.wikimedia.org/dcim/devices/2078/interfaces/ [16:30:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:35:16] elukey: for the form at https://netbox.wikimedia.org/extras/scripts/interface_automation.ProvisionServerNetwork/, I only see the old device name labsdb1012, should I rename it to clouddb1021 at https://netbox.wikimedia.org/dcim/devices/2078/interfaces/? [16:40:15] razzi: checking sorry [16:44:03] so https://netbox.wikimedia.org/search/?q=clouddb1021&obj_type= looks definitely strange, the parent is labsdb1012 [16:44:26] hm ok [16:46:37] razzi: ah yes, see the "Edit the device page with the new name etc.." [16:46:40] https://netbox.wikimedia.org/dcim/devices/2078/ [16:46:53] it is planned, but not clouddb1021 [16:46:58] (still carrying the old name) [16:47:04] ok cool, missed that step! [16:47:24] in theory after this edit you should find it [16:47:27] (in the dropdown) [16:47:41] !log edit https://netbox.wikimedia.org/dcim/devices/2078/ device name from labsdb1012 to clouddb1021 [16:47:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:48:12] I may be spamming with all my !log-ing, but better to log too much than too little I think [16:48:54] +1 razzi :) [16:49:21] razzi: yep yep! If you want you can just !log the macro operation in #operations (like !log rename blabla) and then be spammy in the task or in here [16:49:43] ok gotcha [16:49:44] the important bit in #operations is that people can see if an alert matches with some ongoing ops [16:50:39] ottomata: hey, any chance you'll be able to take another pass over https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/667948 today? [16:50:56] cdanis: yes can look at it today! [16:51:00] ty for reminder [16:51:03] thank you! [16:52:32] !log run script at https://netbox.wikimedia.org/extras/scripts/interface_automation.ProvisionServerNetwork/ [16:52:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:54:44] !log sudo cookbook sre.dns.netbox -t T269211 "Reimage and rename labsdb1012 to clouddb1021" [16:54:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:54:53] T269211: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 [16:56:23] merged cdanis [16:56:30] oh awesome [16:56:39] u can do helmfile stuff? [16:56:47] yep! [17:04:39] razzi: how is it going with the DNS? :) [17:04:47] ah already done, good [17:05:06] yep, now working on "insetup" puppet patch [17:05:17] (03PS1) 10Phuedx: WIP: Add properties to UniversalLanguageSelector schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668743 (https://phabricator.wikimedia.org/T275766) [17:05:47] perfect, let's see then if the reimage works :) [17:07:40] Yup! [17:07:53] !log sudo -i wmf-auto-reimage-host -p T269211 clouddb1021.eqiad.wmnet --new [17:07:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:07:56] T269211: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 [17:26:28] razzi: how is the reimage going? [17:26:51] Seeing "Still waiting for reboot after 10.0 minutes", hopefully that's normal? [17:27:10] I am in the console and I don't see much ongoing [17:27:24] did it get to debian install? [17:27:35] or is it the reboot before it? [17:28:48] Here's the output thus far: [17:28:48] 17:09:08 | clouddb1021.eqiad.wmnet | Removed from Puppet [17:28:48] 17:09:08 | clouddb1021.eqiad.wmnet | WARNING: Unable to remove from Debmonitor, got: 404 [17:28:48] 17:09:08 | clouddb1021.eqiad.wmnet | Set Boot Device to pxe [17:28:48] 17:09:09 | clouddb1021.eqiad.wmnet | Current power status is off, powering on [17:28:49] 17:09:09 | clouddb1021.eqiad.wmnet | Chassis Power Control: Up/On [17:28:49] 17:15:10 | clouddb1021.eqiad.wmnet | Still waiting for reboot after 5.0 minutes [17:28:50] 17:22:40 | clouddb1021.eqiad.wmnet | Still waiting for reboot after 10.0 minutes [17:29:01] ottomata: hellooo, do you have some time (10 mins) to discuss session length? [17:30:21] razzi: strange [17:30:47] mforns: yes now is perfect [17:30:58] ok! bc? [17:31:01] k [17:31:43] razzi: tried to reboot it from the console (vsp -> power reset) [17:32:12] nope I don't see anything [17:34:52] it is strange since the mgmt interface seems working [17:34:59] (03PS1) 10Eric Gardner: Update schema to 1.3.0 and add new "image" mediatype option [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668748 [17:35:34] (03CR) 10jerkins-bot: [V: 04-1] Update schema to 1.3.0 and add new "image" mediatype option [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668748 (owner: 10Eric Gardner) [17:36:57] (03PS2) 10Eric Gardner: Update schema to 1.3.0 and add new "image" mediatype option [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668748 [17:37:11] razzi: when the decom script ran, did it kick off a homer script to update the switch config? [17:37:24] elukey: yes [17:37:41] razzi: and I guess it removed all configs right? [17:39:23] I see in your procedure that the homer stuff is updated at the bottom, but now I don't see anything on the switch related to clouddb1021 [17:40:12] ah also big surprise, the new clouddb nodes are in the private vlan [17:40:21] not in the cloud one [17:41:03] ah snap the same as https://phabricator.wikimedia.org/T260441 [17:42:27] razzi: ok so the situation is a bit complicated, we need to ping Brooke to ask what is the right VLAN, even if I suspect private [17:42:52] ok, I found the original homer output in case that's useful [17:43:06] in case, we'll need to remove the interfaces (except mgmt), re-run the script to provision but in private, and then run again netbox to update the dns [17:43:45] razzi: can you add that into a paste? [17:47:03] yep, one moment [17:47:14] asking to Brooke in the meantime [18:16:35] !log delete non-mgmt interface for clouddb1021 [18:16:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:17:40] !log re-run interface_automation.ProvisionServerNetwork with private vlan [18:17:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:18:44] !log sudo cookbook sre.dns.netbox -t T269211 "Move clouddb1021 to private vlan" [18:18:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:18:48] T269211: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 [18:30:57] !log run again sudo -i wmf-auto-reimage-host -p T269211 clouddb1021.eqiad.wmnet --new [18:31:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:31:00] T269211: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 [18:53:30] * elukey afk! [18:53:35] have a good weekend folks :) [18:55:23] Alright! Reimage worked, clouddb1021 is insetup. Going afk for lunch [19:37:42] (03PS3) 10Lex Nasser: Fix and optimize Hive query and change field names in properties file for top-per-country job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/668236 (https://phabricator.wikimedia.org/T207171) [20:26:49] (03PS1) 10Ottomata: Migrate legacy EL schemas EditAttemptStep and VisualEditorFeatureUse [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668773 (https://phabricator.wikimedia.org/T267343) [20:36:21] joal: you around? [21:39:50] fkaelin: got a sec for a java world brain bounce? [21:43:09] Is it ok to let spark jobs run over the weekend? I am testing a series of jobs that i'd expect to take a couple of days (total) to complete. I configured each job according to the "regular size" job spec. They'll run one at a time. [21:43:26] ya sure sure [21:43:36] ottomata awesome, thanks! [21:43:40] although...i don't remember if there was an issuee with kerberos tickets expiring anymore [21:43:52] but there's def no harm in it [21:44:07] gmodena: 'regular' meaning you are using wmfdata? [21:44:34] as long as it not cause issues to your, i can recover from a ticket expiring (I'll check in with during the day) [21:44:43] ok yeah no issues on the clusterr [21:45:01] ottomata yes, I use wmfdata's config to init SparkSession [21:45:28] there was something in wmfdata about timing out sessions too,but maybe that only happens with the .run function [21:45:33] can't recall atm [21:45:41] ack [21:46:16] i'm using only configs, so hopefully I'm good. And it's a test, no biggie if it fails :) [21:46:23] k :) [21:48:03] i recently discovered yarn.wikimedia.org [21:48:10] <3 your systems. [22:08:37] yay!