[01:26:40] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) ``` pc2007 root@pc2007:~# fdisk -l Disk /dev/sda: 4.4 TiB, 4799217008640 bytes, 9373470720 sectors Units: sectors of 1 * 512 = 512 bytes Sector si... [01:27:27] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) [01:29:22] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) a:05Papaul>03Banyek @Banyek all yours [07:52:29] hey, give a look if you have time to the debian package+puppet error on the new parsercaches (but it is ok if you don't have time, no high priority) [07:54:06] we may need to ping moritz as it could be a debian installer issue that affects other installs- but try to understand what happened there first (it could be a simpler mistake) [08:05:44] ok [10:18:18] 10DBA, 10Operations: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) p:05Triage>03High [10:21:02] 10DBA, 10User-Banyek: Reimage pc2006 with stretch - https://phabricator.wikimedia.org/T207934 (10jcrespo) 05Open>03declined We should work on T208383 instead. [10:21:04] jynus: hey, Can you take a look at https://phabricator.wikimedia.org/T203709 ? Specially s8 on eqiad. It's blocking my work and Manual is not around for the next couple of weeks [10:21:14] if the codfw part can be done too, it would be amazing [10:29:23] 10DBA, 10Operations: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Banyek) a:03Banyek [10:34:22] 10DBA, 10Operations, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Banyek) [10:34:24] 10DBA, 10Operations, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) a:05Banyek>03None [10:36:06] 10DBA, 10Operations, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) a:03Banyek [11:00:23] Amir1: looking [11:43:05] jynus: thanks [11:50:56] Amir1: I am sorry, but we have a blocker on that which is enabling gtid, that will take me some time to start with it [11:51:36] then this and next week we have limited availability, not only because manuel is out, there are holidays and other reasons [11:52:02] which means I will be able to work properly on that exactly on the week manuel comes back [11:52:30] by the time I am synced with understanding what I have to be done- those processes are higly manual [11:52:38] *to do [11:52:54] okay :( [11:53:02] I wish I could help somehow [11:53:05] so we had several outages [11:53:11] and we finally solved those [11:53:31] but sadly now we have to catch up with everthing that wasn't done while I was taking care of those [11:53:55] and I hope you understand that "new features" have the following next priority over "fix broken things" [11:54:38] that doesn't mean they are forgotten, but I have to be realistic about time for that [11:59:58] jynus: yeah, to be clear, it's not a new feature. It's fixing a very old issue :D [12:00:38] I know [12:00:58] but new feature for DBAs means == the server is not in fire [12:01:03] :-/ [12:02:05] haha, true [12:31:42] yay, no fires! [12:32:00] i mean, fires are fun and warm and all, but not on databases :P [12:34:46] addshore: I think I didn't fomally recognized the help you gave us during the wikidata issue [12:35:05] let me do it now, as I was at the time hurrying to fix it [12:36:15] [= thanks, and thanks for knowing how on earth to fix it in the first place :) [12:36:24] I felt like everything was going on all at once that day [12:36:35] well, I made some mistakes like not updating the page timestamps [12:36:57] but I know now better than anyone what happens on wikidata edits :-) [12:37:28] also there may not be a next time as we are setting up integrity checks right now [12:37:32] hehe, indeed, well, I'm not sure how many people would have thought of that anyway :D It took me some hours to even think about it [12:37:43] * addshore is looking forward to integrity checks [12:37:57] I actually thought about that- [12:38:16] but then I said- bah, of course it will look at the latest one, not based on the metadata [12:38:32] but of course it needs that because in some wikis the latests is not the default [12:39:46] we need more team cross-over to get better at what we are doing [13:05:09] how were you thinking on latest, jynus? [13:05:21] higher rev_id? [13:05:52] I don't know, I was just wrong :-D [13:06:22] unlike traditional mediawiki, I don't really know how wikidata stores things [13:06:31] but that looked like let's order by rev_id [13:06:52] which then breaks in the odd case that some revisions were imported... :) [13:09:46] yeah, but it breaks other things too [13:09:54] users, diffs [13:11:11] maybe we should double the storage and store diffs in addition to revisions and aside for the obvious usages, we get things like blame possible [13:11:43] I think diffs *are* stored [13:11:47] ie. cached [13:11:55] as they can be regenerated [13:12:04] cached != stored [13:12:13] our model works with revisions [13:12:20] yes [13:12:22] although sometimes they are compressed [13:12:29] and actually stored as diffs [13:12:31] or even in latin1 :) [13:13:31] tim suggested to maybe store in the future blame trees too to make blame possible [13:14:09] I think the problem is to have a proper blame algorithm [13:14:14] I don't know, I was just throwing some ideas [13:14:28] time ago, I tried just using git blame on articles [13:14:33] but it breaks quite badly [13:14:39] since articles aren't line-based [13:14:43] as code often is [13:15:12] once we have a proper blamer [13:15:25] taht is the least of the issues [13:15:26] it could run very slowly, or get different caching layers [13:15:30] now make it efficient [13:15:51] there are pages with 20000 revisions or maybe 100K [13:15:55] what good is a blame tool which doesn't work? [13:16:16] and they have to be each decompressed and retrieved from disk [13:16:19] if I really care about who wrote XX, I may be ok with waiting an hour [13:16:39] if you are ok with waiting 1 hour, that can be cached and stored :-) [13:17:18] you just need to double the content storage, around 20 TB [13:18:18] (not having into account redundancy, extra caching chain, extra jobqueue resources) [13:18:25] I think it is doable [13:18:40] just not easy or cheap [13:19:02] if it can be regenerated, maybe we don't need it to store it so much [13:19:20] just a big-enough persistent cache could do [13:19:29] oh, it neads, even if you want it dynamic like the parsercaches [13:19:52] so only the most accessed articles get it updated [13:20:43] not sure what you mean [13:22:46] we can store just a subset of the articles to "save" space, the same way only a subset of the editions are parsed into html and stored so our infra doesn't meltdown from rendering wikitext [13:23:14] (saving blame maps) [13:23:31] I thought that most articles had their html in the parsercache [13:26:11] well, the good thing about "caches" is that the most used ones get automatically prefered [13:28:16] Platonides: I see you with enough motivation to take on https://phabricator.wikimedia.org/T2639 :-) [13:28:29] * Platonides reluctantly opens that link [13:30:01] a 2004 task... :) [13:31:02] not so long ago I removed a vandal text from an article: "* Se definió, por ejemplo, que el 60% de los hombres y el 33% de las mujeres participaron al menos en una práctica homosexual manifiesta hasta los 16 años de edad." [13:31:06] https://es.wikipedia.org/w/index.php?title=Escala_de_Kinsey&diff=prev&oldid=111440482 [13:31:37] I searched through the history for 'blaming' it [13:31:53] turned out to have been added 10 years ago by an ip :( [13:31:55] https://es.wikipedia.org/w/index.php?title=Escala_de_Kinsey&diff=19048088 [14:31:25] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Banyek) a:05Banyek>03Papaul @Papaul as I checked the storage on the hosts it's set up for with stripe size of 512Kb instead of 256K (https://wikitech.wi... [14:45:05] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10jcrespo) A larger stripe size should not be a huge issue (unlike a smaller one, which affected performance significantly and we didn't like it). We were thi... [14:47:21] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Banyek) @jcrespo actually i can change the stripe size on one of the hosts, and do some comparison, what do you think about this? [14:48:52] ^ banyek- it requires a lengthy reconfiguration and a reimage, I am a bit worried about that [14:49:15] we are already a bit late with this, so I am unsure [14:49:28] certainly it is better to do it now that later [14:50:14] if you think it won't take you long I am ok with it, but we cannot spend a lot of time with that [14:51:15] e.g. if you can work on puppet on parallel with the reimage, go on [14:51:55] ok, but without that how do I know what are the metrics of "performance of the disk is acceptable" note: I worked with fusionIO drives in the past few years except where I had raid0 SSD :( [14:52:15] hm... [14:52:24] I can compare it with other hosts [14:52:27] ok, nm [14:52:29] I'll check [14:52:55] I really don't have an answer for that, I am just worried about spending too much time on that you understand? [14:53:40] with the disk sizes we are getting, it makes sense to increase the strip size anyway, so up to you :-) [14:54:46] I trust you with doing the right thing no matter what you decide :-) [15:05:42] actually I would left it as-is, but you brought this up [15:05:46] here's my proposal [15:06:50] no proposal on my side, do whatever you consider adequate as long as you don't spend much time on it :-D [15:06:54] I do/check/fix_if_needed the puppet part, and when I am done with it, but we still have time I can go for setting the strip size - when the puppet part is complete, no need to worry about reimaging, as it is pretty quick [15:07:09] cool to me [15:07:45] the only thing you may not know is that we were thinking to increase the strip size already [15:07:55] but never had the time to test it [15:08:04] and the RAID tool is horrible [15:09:03] banyek: do you remember which is the x1 host you setup on codfw? [15:09:07] do you remember the name? [15:09:16] db2096 [15:09:24] thanks, you saved me some search [15:09:30] I glad! [15:09:38] *I am glad [15:11:30] banyek: I forgot one last thing to ask you [15:11:36] about pc* hosts [15:11:43] go for it [15:11:44] disable the learning cycle if they have it [15:11:49] on the RAID controller [15:12:02] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) [15:12:03] so we prevent those to go write-through every 90 days [15:12:31] we do those manually instead of arbitrarily kill our performance :-) [15:12:43] ! I wrote it donw! [15:12:54] *down [15:13:14] I will check that on db2096 as I am checking it for unrelated reasons [15:13:53] good [15:14:01] I did not touched that part [15:15:30] actually, I am looking at documentation [15:15:34] and it may not be needed [15:15:50] there is a new mode "Auto-Learn Mode: Transparent" [15:15:58] which apparently doesn't impact performance? [15:16:05] so it may be no longer needed? [15:16:21] " In PERC H700 and previous, virtuals disks automatically switch to Write-Through mode when the battery charge is low because of a learn cycle. Once the battery charge is sufficient, Write-Back mode will be re-enabled." [15:19:21] I am asking if anyone knows about this and I am understanding it weel [15:29:26] banyek: ok so ignore my last comment for now, it may not be needed based on the above, sorry for the ping [15:30:00] ok [15:30:20] what do you think about this? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/470851/ [15:30:30] it failed because of stretch, right? [15:30:31] looking [15:31:27] one thing, you keep editing line 1954, not sure if intentionally [15:31:46] it is good after your change, just in case you didn't notice it [15:32:26] but I would put it on a separate commit if you really want to fix it to not mix functionality [15:33:06] banyek: the reason it fails is "wmf-style: total violations delta 1" [15:33:13] because site.pp should not have parameters [15:33:34] yes, but if you see I just copied the pc2004 block [15:33:43] so that would fail too? [15:33:49] so the existing ones are wrong- maybe consider refactoring the existing ones first into a hiera parameter? [15:34:04] as I said, it needs a lot of puppet work :-) [15:34:42] aside from that, there are more changes that have to happen at the same time - I would add the 3 of them at the same time [15:34:45] yes, that was the point what I wanted to reach, so confirm if it is possible that the existing one violates a rule [15:34:48] thanks! [15:34:52] and there is some extra thing [15:35:13] that has to happen at the same time- prometheus monitoring [15:35:32] grep pc1004 to see all references to the parsercache hosts on puppet [15:35:37] to understand what I mean [15:35:51] I can help with the refactoring- not an easy one, ok? [15:36:06] I may be able to do that on friday- not sure [15:36:51] if find time for that I'd appreciate it, (but then drop me a message) if not, I'll work on it [15:37:18] sure, I add you as reviewer on every patch normally [15:37:30] that will CC you on every change [15:37:44] great