[06:32:52] 10DBA: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653#3830049 (10Marostegui) [06:44:39] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3830063 (10Marostegui) [06:44:59] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3754580 (10Marostegui) [06:46:37] 10DBA, 10Patch-For-Review: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653#3830065 (10Marostegui) p:05Triage>03Normal [06:46:55] 10DBA, 10Operations, 10Patch-For-Review: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653#3830049 (10Marostegui) [07:10:35] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653#3830085 (10Marostegui) a:03Cmjohnson @Cmjohnson I have been unable to identify which of the PSU is the one failing, the idrac console isn't recording which one is it (sometimes i... [07:20:11] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3830100 (10Marostegui) [08:27:32] I have sent wmf-mysql80_8.0.3-rc-1_amd64.deb to install1002:/home/jynus/stretch [08:29:06] just the executable is 600MB [09:11:30] 91% ETA 1:32:50 [09:23:19] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3830312 (10Marostegui) [09:23:41] 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423#3830317 (10Marostegui) [09:23:43] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3830316 (10Marostegui) [09:23:46] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3803575 (10Marostegui) 05Open>03Resolved [09:45:42] marostegui, jynus: are you reimaging servers ATM? I'm about to refresh the netboot image for stretch (there was a point release for stretch last weekend) [09:46:40] not me [09:46:58] let me check in case marostegui is busy [09:47:52] there is a wmf-auto-reimage -c -- mw1259.eqiad.wmnet [09:48:09] (but not running) on a screen [09:48:20] but in any cases that should be jessie [09:48:22] Nope, I am not doing anything :) [09:48:29] ah, that's an older session I think, for some reason that hung on Thursday [09:48:32] I will check anyway [09:48:38] I did that :-) [09:48:39] in case I find something else [09:48:42] ah, ok [09:48:46] so nope [09:48:49] so mw1259 is fine [09:48:57] k, thanks, updating in 1-2 mins [09:52:19] updated [10:54:21] 19:09:59 100% [10:54:57] \o/ [10:56:24] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3830446 (10Marostegui) s4 eqiad hosts: [] labsdb1001.eqiad.wmnet (broken - will not be done) [] labsdb1003.eqiad.wmnet [] db1102.eq... [10:56:41] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3830447 (10Marostegui) [11:14:16] marostegui: I am touching s5/s8, I am going to assume you are not touching those? [11:14:55] yep, no touching any of those [11:14:57] only s4 and s7 [11:15:16] ok, will ping if I finish with those [11:15:24] cool! [12:16:50] I am going to start touching s1, probably only codfw [13:14:30] sorry - I was having lunch [13:14:37] sure, not touching s1 at all myself [13:45:13] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3830929 (10Marostegui) [14:01:58] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3831001 (10Marostegui) [14:44:04] dbstore1002 will have replicated almost 1 day at this point without errors [14:44:26] nice [14:44:30] it was failing almost every 24h [14:44:34] so that looks promisiing [14:49:46] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653#3831142 (10Cmjohnson) @Marostegui Replaced the PSU and both are now redundant Date/Time: 12/12/2017 14:43:15 Source: system Severity: Critical Description: Power supply... [14:51:56] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Power supply error on db1055 - https://phabricator.wikimedia.org/T182653#3831148 (10Marostegui) 05Open>03Resolved That was fast! Thanks a lot ``` RECOVERY - IPMI Sensor Status on db1055 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK ``` [15:00:48] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3831175 (10Cmjohnson) p:05Normal>03Low [15:04:58] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3831210 (10Marostegui) [15:35:24] 10DBA, 10Operations, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3599592 (10Imarlier) @aaron - see note from Jaime above, he's waiting on answers fro... [15:49:48] db2072 was killed because it went over the timeout :-( [15:50:00] I think the older packages had a non-infinity timeout [15:51:58] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831379 (10Cmjohnson) [15:57:11] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831395 (10Cmjohnson) assigning to @Marostegui for installs [16:04:47] 10DBA, 10Patch-For-Review: Checksum data on s7 - https://phabricator.wikimedia.org/T163190#3831437 (10Marostegui) [16:16:02] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3769627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [16:22:28] I am going to go next with the passive phabricator eqiad node [16:22:45] cool! [16:24:55] are you making 1111 and co jessie or stretch? [16:25:18] jessie [16:34:44] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831544 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [16:34:53] uuuh? [16:34:55] let's see [16:35:57] weird.. the first puppet run works fine, let's try again [16:38:11] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831568 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [16:42:51] marostegui: the first one failed for: [16:42:51] Failed to reset failed state of unit puppet.service: Unit puppet.service is not loaded. [16:43:27] But I ran puppet manually after that and worked fine [16:44:14] marostegui: option pxelinux.pathprefix "trusty-installer/"; [16:44:15] ??? [16:44:24] uh?? [16:44:38] * volans updating local copy [16:45:03] seems the same [16:45:11] in modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 [16:45:19] maybe chris got confused [16:45:37] errr volans are you looking at db1111 or db1011? [16:45:38] :) [16:45:52] this is what I see [16:45:55] ah right [16:45:55] host db1111 { [16:45:55] hardware ethernet 80:18:44:DF:D4:D0; [16:45:55] fixed-address db1111.eqiad.wmnet; [16:45:56] } [16:45:56] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [16:46:04] sorry my bad [16:46:43] marostegui: could it be bad options on first install, or not part of the regex? [16:46:56] if you want jessies, be careful that soon we'll switch to stretch by default: T182215 [16:46:56] T182215: install_server: switch to stretch as default install image - https://phabricator.wikimedia.org/T182215 [16:46:59] I have not checked, just throwing things into the wind [16:47:09] volans: yep, I was aware [16:47:24] jynus: no, the install was fine, what apparently failed was the first puppet run, but doing it manually worked [16:47:27] we will see this time [16:47:30] it is now waiting for it [16:47:33] so we will now soon :) [16:47:34] also, if a reimage fails AFTER d-i, it can be resumed from there [16:48:11] in the sense that the various options allow to skip specific parts [16:48:19] Ah, I didn't know that :) [16:49:41] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831598 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [16:49:44] yeah, failed again [16:50:29] Failed to reset failed state of unit puppet.service: Unit puppet.service is not loaded. [16:50:44] yeah, I am running puppet again manually on the host to see what we get [16:50:59] that's not what failed [16:51:17] is the systemctl reset-failed puppet.service [16:51:18] that failed [16:51:34] jessie or stretch? [16:51:38] jessie [16:56:21] nope [16:56:26] Release: 9.3 [16:56:27] Codename: stretch [16:57:05] mmm [16:57:14] moritzm: did you change the default image today by any chance? it seems that we got a stretch here without specifying it in modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 [16:57:28] same for kafka1023 with elukey [16:57:42] ˜/moritzm 10:45> marostegui, jynus: are you reimaging servers ATM? I'm about to refresh the netboot image for stretch (there was a point release for stretch last weekend) [16:57:48] that is what he said today [16:57:59] yeah that I knew, refresh for the point release [16:58:09] unsure if anything else changed though [16:59:51] I don't see anything on gerrit from moritzm today changing anything [17:00:13] there was a puppet4 client migration ongoing, wasn't it? [17:00:34] and some kind of wikimedia/wikimedia-backports change [17:01:10] but that shouldn't mess up with the installation distro no? [17:01:43] no, at most could explain the failure if there isn't anymore a puppet service to reset fail [17:02:07] but with that in mind every stretch reimage should fail, while it was working [17:02:25] maybe we change the puppet.conf that affected this [17:02:37] at the moment I think the two things are unrelated [17:02:44] 1) we got stretch instead of jessie [17:03:16] 2) the reimage failed because the "systemctl reset-failed puppet.service" failed with "Unit puppet.service is not loaded" [17:03:34] (2) I can fix in the reimage to do it only if it's loaded indeed [17:03:53] (1) I dunno [17:04:10] 2) is a valid fix anyways, no? [17:04:24] volans: no, I only ran the scripts to refresh the netboot images for jessie and stretch [17:04:29] yep, looking at it, although I'd like to know what changed :D [17:04:55] the choice of distros is handled entirely by the install_server puppet module [17:06:11] then that is weird because it got stretch [17:08:59] I cannot see anything weird on puppet or on install1002 [17:11:07] and dns works as intended [17:11:17] volans: you mentioned that also happened to luca? [17:11:57] # cat /srv/tftpboot/jessie-installer/version.info [17:11:58] Debian version: 9 (stretch) [17:11:58] Installer build: 20170615+deb9u2+b1 [17:12:08] seems that something's wrong over there [17:12:18] :/ [17:12:25] :-) [17:12:32] moritzm: ^ [17:12:32] but in pxelinux.cfg/boot.txt refers jessie [17:12:41] * volans confused :D [17:13:25] in the stretch-installer grep -rins jessie returns nothing, and that's correct [17:13:36] while in the jessie-installer both greps find stuff [17:15:26] it is all a trick to upgrade to stretch earlier than intended [17:15:30] haha [17:15:40] upgrade by error :D [17:16:54] volans: hmm, not sure. I just ran the same command as with all previous jessie point updates as well: /home/faidon/update-netboot.sh on puppetmaster1001 [17:17:14] ah, found the error I think [17:17:27] it's paravoid fauld then, it's in his home :-P [17:17:31] should be puppetized [17:17:33] XD [17:17:33] :D [17:17:45] line 11 refers to stable, but jessis is now oldstable [17:18:01] * moritzm wonders why this didn't break for the 8.9 update, but maybe that one didn't have a rebuilt d-i [17:18:07] yeah rolling names bites [17:23:50] volans, marostegui, jynus, elukey: I fixed the code name and re-ran the script on puppetmaster1001 [17:24:11] \o/ [17:24:12] thanks a lot [17:24:19] I will issue a reimage then again :) [17:24:40] let me run puppet on install1002 first :-) [17:24:45] thanks moritzm ! [17:24:51] I was about to do it :) [17:25:48] puppet run completed, should be fine now [17:25:55] let's go! [17:26:07] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831686 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1111.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-rei... [17:26:29] IIRC Debian also releases firmware-enriched images these days, we should switch to those, I'll open a task [17:27:35] do they include the bnx2x too? [17:27:48] I remember for the 10G I had to add them manually in the past [17:28:27] Oh the fun of bnx2 and e1000... [17:30:13] I don't know, needs to be researched, for now I opened https://phabricator.wikimedia.org/T182699 [17:30:32] I only saw this mentioned on debian-devel last week, haven't looked into this myself [17:30:38] ok [17:30:45] would be nice [17:31:38] moritzm: you're the only one that reimaged today also, mw1260 was supposed to be stretch or jessie? [17:31:46] because I guess you got stretch [17:33:24] yeah, that was intentional, it used to use jessie and I wanted to migrate to stretch [17:33:59] so that worked because it was the "correct" stretch [17:34:04] I assume [17:34:05] do we have an s7 outage? [17:34:11] uh? [17:34:40] there is lag yep [17:34:59] is it gone? [17:35:19] yeah, in eqiad yes [17:35:22] in codfw is recoverying [17:35:38] normal, it has to take 2x the time [17:35:46] i know [17:36:55] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1062&var-port=9104 [17:42:58] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1015 - https://phabricator.wikimedia.org/T173570#3831730 (10Cmjohnson) [17:44:36] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3831735 (10Cmjohnson) [17:45:18] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1026 - https://phabricator.wikimedia.org/T174763#3831736 (10Cmjohnson) [17:45:57] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3831739 (10Cmjohnson) [17:46:23] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3831741 (10Cmjohnson) [17:49:25] 10DBA, 10Operations, 10Phabricator, 10hardware-requests, 10ops-eqiad: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3831747 (10Cmjohnson) All non-interruptible steps have been completed. Still needs wiping/removal from rack [17:49:39] marostegui: was something heavy runing on s7 that could have contributed to it? I assume not on the master or codfw? [17:49:55] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1050 - https://phabricator.wikimedia.org/T178162#3831749 (10Cmjohnson) [17:50:34] jynus: nope, nothing running from my side [17:50:51] I was hoping for something to explain it [17:51:13] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3798986 (10Cmjohnson) [17:51:26] were you guys able to reimage a jessie host? [17:51:45] I am getting a kernel panic while pxe installing [17:52:37] (before d-i) [17:52:48] 10DBA, 10Operations, 10Goal: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179#3831770 (10jcrespo) We believe that since s5 was accidentally migrated to ROW, the lag is improved; so it did on labsdbs despite not having any kind of replication control, unlike production. [17:57:03] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3831781 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'] ``` [17:57:16] it timedout when trying pxe [17:57:32] elukey: I just got the kernel panic too [17:57:47] :-) [17:57:59] maybe time to leave it for tomorrow [17:58:02] yeah [17:58:06] I was thinking about that [17:58:09] and research the issues [17:58:20] downtime/disable alerts just in case [17:58:30] * elukey cries in a corner [17:58:32] https://phabricator.wikimedia.org/P6451 [17:58:41] moritzm: --^ [17:58:51] yeah got again too [18:01:31] marostegui: looking better [18:01:46] it only affected db1034 and db1039 [18:02:14] which I am going to assume are not pooled/pooled with 0 weight [18:02:29] that is correct [18:02:33] so actually, no issue [18:02:45] codfw got behind, but at this point this is "normal" [18:03:16] I only got worried because last time that happened, codfw complaind first on master breakage [18:03:33] all other hosts worked nicely, we should just decomm the older servers [18:03:57] proof: https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=7&fullscreen&orgId=1&from=1513097990213&to=1513101633587 [18:04:18] vs: https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=7&fullscreen&orgId=1&from=1513097990213&to=1513101633587&var-dc=codfw%20prometheus%2Fops [18:04:34] marostegui, elukey: hmm, not sure. 3.16.51-2 is the new kernel in the jessie 8.9 update, maybe something has regressed there [18:04:50] 3.16.51-2? [18:05:02] I though there was some newer? [18:05:47] oh, we have 4.9 on jessies [18:06:18] the jessie installer is first using the default kernel shipped by jessie (3.16), the switch to 4.9 only occurs after the initial installation [18:06:28] yes, I understand [18:06:41] I wonder if it is worth updating the installer at this time [18:06:59] I mean the already done upgrade [18:07:31] it's not so simple to update the jessie installer to use 4.9 from the start unfortunately [18:07:43] no no [18:07:45] I meanr [18:07:52] not to update the installer, period [18:07:59] and update packages afterwards [18:08:21] aka install with 8.8 [18:08:42] yeah, but we need to upgrade the installer, it's essentially broken after every jessie or stretch point release (until we implement https://phabricator.wikimedia.org/T182699) [18:08:50] ah, ok [18:09:05] :-( [18:09:08] I only ran the upgrade script since elukey pinged me on IRC with the error earlier the day [18:21:59] marostegui: what was the host that failed with kernel panic? [18:23:21] (let's check tomorrow :) [18:28:21] db1111 [18:29:19] 10DBA, 10Operations, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3831865 (10aaron) >>! In T175672#3778177, @jcrespo wrote: > @aaron the proxy is inst... [18:30:43] just opened https://phabricator.wikimedia.org/T182702 [18:31:24] thanks! [18:47:41] 10DBA, 10Operations, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar): Make apache/maintenance hosts TLS connections to mariadb work - https://phabricator.wikimedia.org/T175672#3831913 (10jcrespo) > A local and foreign replica would do it is installed on both... [19:39:41] 10DBA, 10Data-Services: Make Dispenser's principle_links table accessible in new Wiki replica cluster - https://phabricator.wikimedia.org/T180636#3832045 (10Dispenser) @jcrespo The current pipeline is: # A bash/python script on ToolForge which makes 275,000 MW API requests and bundles JSON responses in a `.tar...