Operations SIG 18 Nov 2024: Difference between revisions
No edit summary |
|||
Line 41: | Line 41: | ||
== Meeting Minutes Summary (chatGPT) == | == Meeting Minutes Summary (chatGPT) == | ||
== Meeting Minutes Full == | == Meeting Minutes Full == |
Latest revision as of 12:51, 24 December 2024
Agenda
What: Meeting to Discuss Improving Node Operations as part of the HC1: OP SIG
When: 19 Nov 2024 @ 8 CET EST
Where: https://matrix.to/#/#sig-op:hc1.chat
Chair: 0x3639
Agenda:
- Discuss follow Up items from previous meeting
- Document action items
- Establish next meeting
If you want to attend please respond (or DM) with your full matrix username and I will invite you to the group. No FUD, anger or BS allowed.
Pre-meeting Notes
- Created a troubleshooting script that runs a series of actions that help trouble shoot go-zenon. Runs basic linux commands to check the service, disk space, UFW, and then looks at logs and looks at some node endpoints.
- Created a bootstrap / restore script that stops go-zenon, backups and compresses the necessary files, and then restarts go-zenon
- I've been testing locally and need to submit a PR.
- Traveling this week so probably can't attend the meeting
- I have the znnd_exporter (prometheus metrics) code ready. Working on the dashboard and getting it auto-installed
- I want to make sure we start planning for the HyperQube Network Launch Ops support work
- Created a branch in which the database references are explicitly released. Commit message for context:
This commit introduces explicit releasing of database handles. The LevelDB package relies on the Go GC to cleanup unused snapshot references, but many other database packages require snapshots to be released explicitly. These changes serve as a starting point for assessing the usage of alternative databases.
- Releasing the DB references manually provides no apparent improvement in performance - possibly a negative effect in performance. Would need more testing to determine the effect.
- Overall the task of manually managing the references is very tedious (and complicated inside the account pool) and as can be seen from the amount of changes done in the branch, it is not a trivial change and affects a vast portion of the codebase.
- Based on personal testing and anecdotal evidence from others the recommended approach for syncing a node from scratch on a VPS with non-dedicated resources should be to first sync the node on a local machine and then transfer the node's database to the server.
- Syncing a node locally on my machine only takes around 13 hours, while on a VPS with shared resources it can take over a week. This would suggest that LevelDB is not the main culprit for the slow sync, raising into question how much time should be spent on investigating the replacement of LevelDB right now.
Meeting Minutes Summary (chatGPT)
Meeting Minutes Full
Tue, Nov 19, 2024, 12:59:59 - deeznnutz: === START OP SIG 19 NOV 2024 ===
Tue, Nov 19, 2024, 13:00:00 - deeznnutz: GM
Tue, Nov 19, 2024, 13:00:22 - vilkris: Gm
Tue, Nov 19, 2024, 13:00:32 - coinselor: hihi
Tue, Nov 19, 2024, 13:00:41 - deeznnutz: I think we can move quickly today. thx for pushing it one day
Tue, Nov 19, 2024, 13:00:58 - deeznnutz: vilkris: I went through your update. Is there anything you wanted to add.
Tue, Nov 19, 2024, 13:01:05 - deeznnutz: I have a proposed test
Tue, Nov 19, 2024, 13:01:36 - vilkris: Not much to add, but I do think it's good if others can replicate the results
Tue, Nov 19, 2024, 13:01:55 - deeznnutz: Proposed test to confirm Vilkris' results
1\) Test 1: sync `go-zenon` directly on this machine - Supermicro SuperServer 5018D-FN8T Xeon D 1U Rackmount,10GbE,SFP+,32GB & 512GB M.2
2\) Test 2: install proxmox on the same machine, and allocate 100% of the available resources to a single VM and perform the same sync
Tue, Nov 19, 2024, 13:02:02 - deeznnutz: what do you think about this test?
Tue, Nov 19, 2024, 13:02:28 - deeznnutz: with this we can determine if the hypervisor is causing an issue
Tue, Nov 19, 2024, 13:02:57 - vilkris: Sounds good to me. I'm not an expert in that area
Tue, Nov 19, 2024, 13:03:40 - vilkris: It's also easy for anyone to confirm by syncing the node locally on their machine
Tue, Nov 19, 2024, 13:04:16 - vilkris: I was supposed to try the sync on a Mac M1 but didn't get around to that yet
Tue, Nov 19, 2024, 13:04:26 - deeznnutz: I have a lot of crap going on with my mac. I wanted to isolate that and use a clean machine so nothing else is competing for resources
Tue, Nov 19, 2024, 13:04:37 - coinselor: George highlighted the need to not just take sync times at face value but attach each test to comprehensive specs:
"Probably want to give details like:
cloud provider
vm type
specs"
Does the troubleshooting script deeznnutz you made output this information? If not, I can look into adding any missing information to it that might be useful. Maybe some internet speed test?
Tue, Nov 19, 2024, 13:05:14 - deeznnutz: <@coinselor:zenon.chat "George highlighted the need to n..."> it does not. but that is a good idea
Tue, Nov 19, 2024, 13:05:31 - deeznnutz: i did not have inet speed or ping. those could be helpful for sure
Tue, Nov 19, 2024, 13:06:36 - deeznnutz: I'll run this test. On this server. I'm literally setting it up now. At a minimum we can determine if a hypervisor is causing an issue. if not we can start testing more stuff / options.
Tue, Nov 19, 2024, 13:07:58 - vilkris: <@coinselor:zenon.chat "George highlighted the need to n..."> Regarding the different specs on cloud providers we might also want to add a way to time the sync from height 1 to some pre defined height, like 8M
Tue, Nov 19, 2024, 13:08:27 - vilkris: Otherwise it's hard to accurately time the sync
Tue, Nov 19, 2024, 13:08:51 - deeznnutz: is there something you can do to stop the sync at 8m and implement a simple timer?
Tue, Nov 19, 2024, 13:09:47 - vilkris: Without modifying go-zenon I'm not sure
Tue, Nov 19, 2024, 13:10:07 - deeznnutz: we can parse the logs
Tue, Nov 19, 2024, 13:10:25 - vilkris: Yeah that might work
Tue, Nov 19, 2024, 13:10:26 - coinselor: couldn't we hack it to kill the process parsing the grafana logs znnd_exporter george is making?
Tue, Nov 19, 2024, 13:10:28 - deeznnutz: I can do it with a bash script just watching the logs
Tue, Nov 19, 2024, 13:11:05 - deeznnutz: I can also use monit.
Tue, Nov 19, 2024, 13:11:20 - deeznnutz: let me see what I can hack together without touching go-zenon
Tue, Nov 19, 2024, 13:11:41 - vilkris: The logs seem like the easiest approach
Tue, Nov 19, 2024, 13:11:48 - deeznnutz: ya, agree
Tue, Nov 19, 2024, 13:12:26 - vilkris: They are timestamped I think it should record when a momentum is inserted, at least on the debug level
Tue, Nov 19, 2024, 13:13:02 - vilkris: Just need to also record the first momentum's time as well
Tue, Nov 19, 2024, 13:13:20 - deeznnutz: yep. I have an idea how to do that really easily.
Tue, Nov 19, 2024, 13:13:27 - vilkris: Okay nice
Tue, Nov 19, 2024, 13:13:41 - vilkris: That would make it easy to compare results if it's always the same amount of momentums
Tue, Nov 19, 2024, 13:14:19 - deeznnutz: Moving on to the troubleshooting flag for our script.
Tue, Nov 19, 2024, 13:14:45 - deeznnutz: The amount of Time I spend on go-zenon trouble shooting is pretty high.
Tue, Nov 19, 2024, 13:15:31 - coinselor: We might be also able to parse the logs of the time series data the znnd exporter creates to say figure out how long it took to get to a specific milestone like 1M momentums, but I'm not sure.
Tue, Nov 19, 2024, 13:15:54 - deeznnutz: <@coinselor:zenon.chat "We might be also able to parse t..."> that is the work that george is actually working on
Tue, Nov 19, 2024, 13:16:09 - coinselor: as far as I understand it basically just reformats the output from znnd logs
Tue, Nov 19, 2024, 13:17:28 - deeznnutz: george wrote that code in `go` to track the speed of momentum production. This simple script I can use is just temp until george deploys his code and we can visualize it in grafana.
Tue, Nov 19, 2024, 13:18:20 - deeznnutz: Anything else on this before we move to troubleshooting?
Tue, Nov 19, 2024, 13:18:37 - vilkris: Not from me
Tue, Nov 19, 2024, 13:19:00 - deeznnutz: Cool. So the time required to trouble shoot is very high.
Tue, Nov 19, 2024, 13:19:20 - deeznnutz: people who do not know how to use linux - it's very hard to get basic information.
Tue, Nov 19, 2024, 13:19:31 - deeznnutz: Like... is the hard drive out of space.
Tue, Nov 19, 2024, 13:19:54 - deeznnutz: So I will publsh this script to help with troubleshooting and coinselor maybe you can review and improve
Tue, Nov 19, 2024, 13:20:15 - deeznnutz: My question is, what is the best way to get the output to me without copy / paste
Tue, Nov 19, 2024, 13:20:28 - deeznnutz: we have sever limits on what some can do
Tue, Nov 19, 2024, 13:20:46 - deeznnutz: I was going to try to push the results to a TG bot
Tue, Nov 19, 2024, 13:20:54 - deeznnutz: but need to expose API keys to do that
Tue, Nov 19, 2024, 13:22:24 - coinselor: Yeah, I'm sure we can think of something to publish the data somewhere. There's no sensitive information right? It's basically specs
Tue, Nov 19, 2024, 13:22:45 - deeznnutz: I think george is very reluctatnt to ask pillars to expose their IP
Tue, Nov 19, 2024, 13:23:02 - deeznnutz: I could setup an API to receive the data but it will expose the IP
Tue, Nov 19, 2024, 13:23:11 - deeznnutz: but not to me if they push to TG
Tue, Nov 19, 2024, 13:24:20 - deeznnutz: So I was going to encrypt the TG API keys and use some tool to decrypt them when sending the troubleshooting results to TG
Tue, Nov 19, 2024, 13:24:35 - deeznnutz: but I'm open to suggestions
Tue, Nov 19, 2024, 13:24:49 - coinselor: We should look into it. We can copy paste for now. Maybe there's a good solution using nostr/tor or something.
Tue, Nov 19, 2024, 13:25:21 - deeznnutz: ya maybe we can dive in a little more and see what we can come up with.
Tue, Nov 19, 2024, 13:26:25 - vilkris: Don't have any ideas off the top of my head. But I can understand that even copy pasting stuff can be a real pain point
Tue, Nov 19, 2024, 13:26:55 - deeznnutz: <@vilkris:hc1.chat "Don't have any ideas off the top..."> ya, I get a lot of screen shots... which is why I would love a file.
Tue, Nov 19, 2024, 13:27:12 - deeznnutz: Maybe we can take it offline and come up with a solution
Tue, Nov 19, 2024, 13:27:14 - coinselor: <@deeznnutz:zenon.chat "So I will publsh this script to ..."> Sounds good btw. I'll also look into creating a wrapper in go for the scripts as a way for me to get started/familiarized with go. I also wanna make it pretty with that Bubble Tea framework george keeps sharing
Tue, Nov 19, 2024, 13:27:41 - deeznnutz: ya that would be cool. Love that Bubble Tea Framework
Tue, Nov 19, 2024, 13:28:12 - deeznnutz: I have the backup / restore script working well. I just need to submit it for review
Tue, Nov 19, 2024, 13:28:44 - deeznnutz: and george gave an update on his work. He needs to make the dashboard and make the auto-install
Tue, Nov 19, 2024, 13:29:21 - deeznnutz: Finally the HQZ work. I think reusing our scripts for HQZ will be pretty easy
Tue, Nov 19, 2024, 13:30:27 - vilkris: <@deeznnutz:zenon.chat "I have the backup / restore scri..."> I'm assuming this is only for restoring from a local backup?
Tue, Nov 19, 2024, 13:30:32 - deeznnutz: Yes
Tue, Nov 19, 2024, 13:30:52 - deeznnutz: I have a separate one where it pulls from DO, but I'm not going to publish that
Tue, Nov 19, 2024, 13:31:01 - deeznnutz: I backup daily to DO just in case
Tue, Nov 19, 2024, 13:32:48 - vilkris: Okay, just thinking that if the new suggested approach is to sync the node locally then being able to easily transfer the node data onto the server would be useful
Tue, Nov 19, 2024, 13:33:02 - vilkris: Not sure how much work it would be to add that type of functionality to the script
Tue, Nov 19, 2024, 13:33:19 - deeznnutz: actually not that bad
Tue, Nov 19, 2024, 13:33:36 - deeznnutz: its just a .tar, scp, and untar
Tue, Nov 19, 2024, 13:34:02 - deeznnutz: that's what I do today to send to DO, but with the S3CLI
Tue, Nov 19, 2024, 13:34:21 - coinselor: We could look into offering connector's to large cloud providers but I'm sure there's probably tools already built we could leverage. And maybe out of scope.
Tue, Nov 19, 2024, 13:34:46 - deeznnutz: S3CLI is a great tool, but you need an S3 endpoint
Tue, Nov 19, 2024, 13:34:49 - vilkris: Gotcha. Well we can think of that more once we've confirmed that local syncing is a solution
Tue, Nov 19, 2024, 13:34:50 - coinselor: Another minor concern would be an user choosing i.e AWS over others because we have a bootstrap/backup feature on a script or something like that
Tue, Nov 19, 2024, 13:35:52 - deeznnutz: We can add an S3 endpoint for sure. I have that setup today. It just requires an s3 config file.
Tue, Nov 19, 2024, 13:36:20 - deeznnutz: but maybe we start with local backup and restore and move to S3 and/or SCP based on test results?
Tue, Nov 19, 2024, 13:37:03 - deeznnutz: Anything else to discuss?
Tue, Nov 19, 2024, 13:37:29 - deeznnutz: Next meeting 16 Dec 24 @ 8PM UTC?
Tue, Nov 19, 2024, 13:38:05 - coinselor: Looks good to me. Nothing to add.
Tue, Nov 19, 2024, 13:38:17 - deeznnutz: Cool - thx guys!!
Tue, Nov 19, 2024, 13:38:29 - vilkris: Thanks!
Tue, Nov 19, 2024, 13:38:39 - deeznnutz: === END OP SIG 19 NOV 2024 ===