[04:55:44] 10DBA, 10Operations: move db1114 to s8 - https://phabricator.wikimedia.org/T250224 (10Marostegui) p:05Triage→03High [05:03:53] 10DBA, 10Gerrit: Investigate Gerrit troubles to reach the MariaDB database - https://phabricator.wikimedia.org/T247591 (10Marostegui) Good to hear! Thank you! [05:14:57] 10DBA, 10Operations: move db1114 to s8 - https://phabricator.wikimedia.org/T250224 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1114.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202004150514_marostegui_23354.log`. [05:31:30] 10DBA, 10Operations: move db1114 to s8 - https://phabricator.wikimedia.org/T250224 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1114.eqiad.wmnet'] ` and were **ALL** successful. [05:45:04] 10DBA, 10Operations, 10Traffic, 10serviceops: Audit and harmonize timeouts across the stack - https://phabricator.wikimedia.org/T250251 (10Joe) [06:14:57] 10DBA, 10Operations: move db1114 to s8 - https://phabricator.wikimedia.org/T250224 (10Marostegui) I am populating this host from s8's snapshot from yesterday (600GB) already transferred. [06:37:17] 10DBA: inverse_timestamp column exists in text table, it shouldn't - https://phabricator.wikimedia.org/T250063 (10Marostegui) @Ladsgroup that patch you posted is for the revision table - which btw includes some fields that will be dropped at https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/552339 so not sure... [06:40:34] 10Blocked-on-schema-change, 10Wikimedia-Rdbms, 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Marostegui) [06:44:34] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) [06:44:48] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) p:05Triage→03Medium [06:48:04] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) [06:52:52] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) s6 eqiad [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1005 [] db1139 [] db1131 [] db1125 [] db1113 [] db1098 [] db1096 [] db1093 [] db1088 [] db1085 [07:01:30] s8 looks healthier now :) [07:03:28] it always does in the morning, it is not an early bird [07:03:31] I am finishing with the new host [07:03:36] <3 [07:04:44] I think having some graphs for what rate which wikis are making queries to which sections would be greatly beneficial, it would allow us to narrow down what the load on s8 is being caused by [07:05:09] dont think there is anything for that yet [07:05:50] nope, we don't have that [07:44:28] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) [07:59:48] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) a:03Marostegui [07:59:59] addshore: I am about to start pooling the new host :) [08:14:40] db1114 is now pooled in s8 with weight 1 [08:14:45] Checking errors just in case [08:14:48] (grants are there) [08:15:29] reads are going through, so far so good [08:15:56] db1114 has all alerts disabled FYI [08:17:11] I just enabled them [08:17:21] see the commit [08:17:42] well, didn't check puppet, just icinga :-D [08:17:46] hehe [08:18:00] no, I mean they should be enabled on icinga before being pooled [08:18:26] it was done at the same time and also manually [08:18:33] so that was it [08:18:41] I also double checked the host was all green [08:27:02] hi, I'm processing access requests as part of clinic duty, re: T249059 my understanding is that those folks should be filing tasks for db test hosts instead (?) [08:27:03] T249059: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 [08:27:25] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) s7 eqiad [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1136 [] db1125 [x] db1116 [x] db1101 [x] db1098 [] db1094 [x] db1090 [] db1086 [] db1079 [08:27:32] godog: correct [08:27:56] marostegui: ack, thank you! I'll update the task [08:28:00] godog: I already told them https://phabricator.wikimedia.org/T249059#6042209 [08:29:33] ah yeah! [08:37:37] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) [08:44:04] 10DBA, 10Operations, 10Traffic, 10serviceops: Audit and harmonize timeouts across the stack - https://phabricator.wikimedia.org/T250251 (10fgiunchedi) p:05Triage→03Medium [08:56:17] addshore: kormat just increased a bit the weight of db1114's so we are slowly giving it more traffic while monitoring it [09:05:42] I stumbled upon your screen by accident [09:05:49] "WARNING: Original size is 569546740275 but transferred size is 1670766106368 for copy to db1114.eqiad.wmnet" is normal [09:05:56] I know :) [09:06:01] because I cannot calculate properly the size of the tar :-( [09:06:02] Can you actually kill that screen? I am done with it [09:06:15] I can do it, should I [09:06:35] yeah, kill it [09:07:03] you have another called compression FYI [09:07:09] but seems to be ongoing [09:07:24] sorry to look into your private things 0:-) [09:07:28] yep, that is on-going [09:07:31] cool [09:07:43] I did -r and though the transfer one was mine [09:14:29] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` backup2002.codfw.wmnet ` The log can be found in... [09:14:31] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['backup2002.codfw.wmnet'] ` [09:15:58] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['backup2002.codfw.wmnet'] ` The log can be foun... [09:16:05] sorry for the spam [09:36:18] 10DBA, 10Operations: move db1114 to s8 - https://phabricator.wikimedia.org/T250224 (10Marostegui) db1114 added to the Wikidata CPU dashboard [09:37:13] hows wikidata these days, more loaded than ever? [09:37:31] yesterday we had to tackle weights during the night [09:37:41] oh [09:37:44] as db1092 is out of rotation it affected the rest of hosts, which is worrying [09:37:50] I see [09:37:57] that's why I decided to place db1114 today asap [09:39:33] was db1114 with hw issues? [09:40:35] it crashed once, but ok since then, I guess https://phabricator.wikimedia.org/T214720 [09:40:46] Whoo [09:40:56] jynus: yeah, and we replaced the mainboard [09:41:05] ah, I didn't remember that [09:41:13] yep [09:41:27] I guess the general increased page view rate of all wikis probably has some impact on all of this load problems [09:43:13] was there deployments yesterday? [09:44:41] yes but the CPU peaking was observed since monday [09:44:44] there was, but I don't see a bit correlation: https://grafana.wikimedia.org/d/000000273/mysql?panelId=16&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1087&var-port=9104&from=1586857479102&to=1586943879102 [09:44:46] so not related to that deployment [09:44:48] yeah [09:45:08] it is clearly the reduction of power by db1092 being depooled [09:45:20] which means that we cannot survive a host going down [09:45:33] With db1114 that changes [09:45:38] hence the urgency [09:46:28] btw, at some point we have to decided when to delete wikidatawiki.wb_items_per_site_old [09:46:30] addshore: ^ [09:46:51] We should probaby create a task for it, so we don't forget [09:47:31] We could delete that now imo, but I'd let Amir1 also have a say [09:47:43] yeah, there is no rush [09:47:50] just let's make sure we don't forget [09:48:12] yeah, let's delete it, everything is normal now [09:48:39] we have backups in any case [09:50:39] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10jcrespo) Manual install went through- not ideal, but enough to unblock this task. Now we only need to setup the storage- I can take care of this, as the... [09:57:00] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2002.codfw.wmnet'] ` and were **ALL** successful. [09:57:07] yay [10:57:42] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) [11:00:07] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) s8 eqiad [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1005 [x] db1126 [] db1124 [x] db1116 [x] db1114 [x] db1111 [] db1109 [x] db1104 [x] db1101 [x] db1099 [] db10... [11:06:20] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) [11:06:34] 10DBA: type_acton index in logging table is lingering in production - https://phabricator.wikimedia.org/T250057 (10Marostegui) 05Open→03Resolved This is all fixed [12:41:47] asking the important questions: https://twitter.com/jynus/status/1250403790227980295 [12:42:07] \G of course [12:42:24] jynus: it's missing an option [12:42:30] \g ? [12:42:33] both! [12:42:38] no crazies in my polls! [12:42:55] sure, there is a thing called "comments" ;-) [12:43:18] it requires to have an account on something called "social network"... [12:43:21] :-P [12:43:37] marostegui: I see you as a fellow gentleman! [12:44:03] marostegui: https://jynus.com/gif/cheers.gifv [12:46:57] * marostegui hugs jynus (with mask and gloves) [12:47:28] social distance, 1 aligator away/2 baguettes! [12:47:29] jynus: https://jynus.com/gif/%3C3.gifv [12:49:12] * kormat notes that jynus.com has some quality content [12:49:32] kormat: I have a list of all the ones he's released for now [12:49:39] "for now" [12:49:43] haha [12:49:55] Yes, and I need to complain again, that since new year's there's no new ones [12:50:00] And it is almost summer [12:50:14] kormat: best one is the one I did for april's fools: https://jynus.com/gif/april_fools.gifv [12:50:49] https://jynus.com/gif/clap.gifv [12:51:10] ಠ_ಠ [12:51:59] let me try that again: ಠ_ಠ, [12:52:08] hahaha [12:52:45] i have a feeling i'll need to keep that around for handy pasting [12:53:03] there. it's now the first item in my list of notes about this place. [12:55:23] that, and making sure your emoji IRC support is up to standards! 🔝 [13:22:38] I need someone to bounce back some storage decisions if around [13:24:26] 10DBA, 10Operations: move db1114 to s8 - https://phabricator.wikimedia.org/T250224 (10Kormat) 05Open→03Resolved db1114 is now fully pooled with weight 300. db1104 was also reduced from 350 to 300 to match. [13:25:26] 10DBA, 10Operations: move db1114 to s8 - https://phabricator.wikimedia.org/T250224 (10Marostegui) Thank you so much for the help! This should help s8 indeed. I expect also to repool db1092 tomorrow once the ALTERs are done. Later today I will also decrease the weight for the RC slaves, which we increased yeste... [13:26:53] addshore Amir1: we are done with db1114, fully pooled in s8: https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation?orgId=1 so we've got one extra host :) [13:27:25] marostegui: do you have 5 minutes? [13:27:31] yep [13:27:40] nice, thanks! [13:27:40] I am in front of the RAID utility [13:27:47] * marostegui runs [13:27:50] I have 10 disks [13:28:03] of 7451.5 GB [13:28:19] and have to balance reliability and performance (in all senses) [13:29:11] wait, 8GB? [13:29:18] didn't we buy 16GB ones? [13:29:29] 16T you mean? [13:29:40] yes, sorry, 16KG [13:29:41] Or what? [13:30:01] Is that for the backup disk shelf? [13:30:04] yes [13:30:19] initially I have thought of having 2 virtual disks [13:30:25] on RAID6 [13:30:27] we bought 8TB disks as far as I remember [13:30:37] let me check [13:30:47] in any case, you see my proposal [13:31:05] lots of disk space on the same partition looks like a bad idea [13:31:09] but on the other side [13:31:28] it is a bit of a waste and leave us with little room after setting it up [13:32:30] you mean 2 VD with 5 disks each on RAID6? [13:32:33] indeed, it was ok 8TB disks [13:32:39] yes [13:32:44] but forget it [13:32:51] because I thought we had 16TB ones [13:33:00] that is not as much [13:33:13] I will go with a single partition so we have 2 redundant disks only [13:33:15] yeah, also doing 2 VD you'll be wasting lots of TB indeed [13:33:28] I thought we had double the space [13:34:09] 10 8TB disks on raid6 should be around 60TB or something no? [13:34:14] probably less [13:34:25] And 5 must be around 20, so that's a lots of wasted TB [13:34:27] they are 7600 GB or so [13:34:31] yeah [13:34:35] so something like that [13:34:37] 50 TB usable [13:34:42] (I am doing the math from my mind) [13:34:47] which is how we have the other shelf [13:34:53] this is half of I had in mind [13:35:03] right [13:35:09] So better 1 VD disk yeah [13:35:15] enough for our needs, but we may have to buy anoter shelf next year [13:35:43] for content we need 12 + 12+ 12 + X TB [13:36:26] 26TB now + growth [13:36:32] yeah [13:36:45] we can probably assume another 12x2 TB (es4 and es5) as a final target [13:36:45] sorry, I meant 36 [13:36:48] but of course not this year [13:37:33] for metadata we need 1.5+3 TB each week [13:38:19] 54 TB for 3 month retention [13:38:45] I need to tune down retention [13:38:58] how many days is it now? [13:39:03] 90 [13:39:10] nice [13:39:37] but we may need to make 2 different pools (3 in total) [13:39:57] 1 for logical backups, 1 for snapshots, with lower retention, and 1 for logical content [13:40:25] idk, this needs more thought [13:40:29] that was all, really [13:40:31] for now [13:40:33] :) [13:41:26] oh sait [13:41:28] wait [13:41:40] we have 12 disks, there was a second page on the raid utility [13:41:46] oh XDD [13:42:20] so let's re-do the math [13:42:22] that's 90 TB raw [13:42:37] so it is more than the previous shelfes [13:42:42] just not double [13:42:45] so raid 6 is over 7.6 is 76TB raw [13:42:56] yep, give or take [13:43:02] and over 6 disks it is 30 or so [13:43:07] That's almost 16TB wasted [13:43:08] that's too litle [13:43:13] "wasted" [13:43:29] I would go still for single virtual disk? [13:43:34] yes, +1 [13:43:46] we still can add a second self [13:43:59] but that gives me back the space I counted with :-D [13:44:05] haha [13:44:19] ok for me to go and visit the supermarket? [13:44:26] or you want to check more things? [13:44:28] thanks, you are my favouritye rubber duck! [13:44:33] :-DDD [13:44:39] haha [13:44:51] everthing is fixed after taking to you [13:45:00] even just realizing of my own stupidity [13:45:14] XDDDDDDDDDDDDDDDDDDD [14:12:45] df -h ~> /dev/mapper/array1-content 73T 24K 73T 1% /srv/content [14:28:06] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10jcrespo) [14:31:59] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10jcrespo) I've setup the array now: ` df -h /dev/mapper/array1-content 73T 24K 73T 1% /srv/content ` This is a summary of the setup, I will doc... [15:06:28] 10DBA, 10Data-Services, 10Projects-Cleanup: Drop DB tables for now-deleted fixcopyrightwiki from production - https://phabricator.wikimedia.org/T246055 (10Bstorm) [15:08:48] 10DBA, 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10jcrespo) 05Open→03Resolved [15:12:23] 10DBA, 10Operations, 10Goal: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) Codfw hw now available: T248934 (75TB total), only needs puppetization. [18:35:30] 10DBA, 10Data-Services, 10Operations, 10cloud-services-team (Kanban): Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 (10JHedden) a:03JHedden [18:46:57] 10DBA, 10Data-Services, 10Operations, 10cloud-services-team (Kanban): Prepare and check storage layer for gr.wikimedia.org - https://phabricator.wikimedia.org/T245912 (10JHedden) 05Open→03Resolved `name="Updates on labsdb10{09,10,11,12}" $ sudo /usr/local/sbin/maintain-replica-indexes --database grwiki...