Operations SIG 23 Dec 2024
Agenda
What: Meeting to Discuss Improving Node Operations as part of the HC1: OP SIG
When: 23 Dec 2024 @ 8 CET EST
Where: https://matrix.to/#/#sig-op:hc1.chat
Chair: 0x3639
Agenda:
- Discuss follow Up items from previous meeting
- Document action items
- Establish next meeting
If you want to attend please respond (or DM) with your full matrix username and I will invite you to the group. No FUD, anger or BS allowed.
Pre-meeting Notes
- Created a deploy script for go-hyperqube. Tested and deployed.
- Created a backup and restore script. Submitted PR and received feedback for revisions
- Created a PoC for sync speed testing but it's too manual. Need to improve it.
- Implemented troubleshooting script that saves logs and shares them to a private Telegram channel
- Started to work on zenon network docs.
- Working on a simple load testing script usable across hyperqube and mainnet testnets
- Will pick up the sync Height exporter again as well
- Started a refactor of the script into more manageable chunks.
- Added a TUI using Gum
- Probably waiting until the script is more stable to finish the refactor.
Meeting MinutesSummary (chatGPT)
Updates from Previous Meeting:
1. Completed:
• Troubleshooting script deployed and tested.
• go-hyperqube released with new deployment flags.
2. In Progress:
• Local backup and restore submitted, pending revisions.
• GUI improvements for deploy script under consideration.
Discussions and Actionable Insights:
1. Performance Testing:
• Begin performance testing on hyperqube_z testnet.
• Consider frameworks (e.g., K6) for testing and future automation.
2. Hardware Recommendations:
• Goal to recommend specs for Pillars, emphasizing dedicated CPUs with fewer but high-performance cores.
• Test and document sync speeds on various environments (bare-metal, virtualized, etc.).
3. Local Backup/Restore:
• Shift focus to implement a local backup/restore system first.
• Future plan to enhance it with remote/cloud backup solutions.
4. Testing Framework:
• Need a comprehensive suite for sync speed tests and regression validation.
• Scripts to separate network and verification parts for focused testing.
5. Potential Enhancements:
• Web UI for easier backup/restore operations.
• Improvements to sync processes for nodes using efficient hardware configurations.
Challenges Identified:
• Sync speeds on shared CPU systems are significantly slower.
• LevelDB’s single-threaded performance is a bottleneck.
• Community members need accessible tools for backup/restore.
Action Items Due
1. Backup/Restore:
• Complete and deploy the local backup/restore feature (assigned: deeznnutz).
• Investigate user-friendly methods to copy backups offsite (assigned: Brat).
2. Performance Testing:
• Continue iterating on scripts to measure sync duration and server load per momentum.
• Plan a future work package for comprehensive testing frameworks (assigned: deeznnutz & Georgez).
3. Hardware Recommendations:
• Begin documenting sync tests with different hardware configurations and environments.
• Investigate and validate dedicated CPU plans for optimized performance.
4. Testing Frameworks:
• Research integration of K6 for load testing hyperqube_z (assigned: Georgez).
• Explore script-based automation to support regression and performance tests.
5. Community Support:
• Provide interim support for community members struggling with node setups, focusing on bootstrap and sync solutions.
6. Miscellaneous:
• Look into SSH-based tools and web UIs for user-friendly node management (assigned: Coinselor).
Next Meeting
• Date: 27 Jan 2025, 8 CET (1 EST).
Meeting Minutes Full
Mon, Dec 23, 2024, 12:00:27 - deeznnutz: === START OP SIG 23 DEC 2024 ===
Mon, Dec 23, 2024, 12:01:18 - deeznnutz: GM
Mon, Dec 23, 2024, 12:06:04 - georgezgeorgez: gm
Mon, Dec 23, 2024, 12:06:14 - georgezgeorgez: sorry stepped away for a moment
Mon, Dec 23, 2024, 12:06:23 - deeznnutz: NP. might be just you and me today.
Mon, Dec 23, 2024, 12:06:29 - georgezgeorgez: sounds good
Mon, Dec 23, 2024, 12:06:44 - georgezgeorgez: did you have any agenda notes to post first?
Mon, Dec 23, 2024, 12:06:47 - deeznnutz: yes
Mon, Dec 23, 2024, 12:06:50 - deeznnutz: Major action items & updates from last meeting:
* Deploy troubleshooting script (done & tested). It's in production now.
* Deploy local backup and restore (submitted and revising)
* We released `go-hyperqube` to help deploy the new Layer 2 with the flag `--deploy --hq`
* We saw a cool gui from brat last meeting that could make the deploy script even easier to use.
* Any other updates from last meeting?
Mon, Dec 23, 2024, 12:08:24 - georgezgeorgez: I wasn't at last meeting, but on my end. I want to start performance testing the hyperqube_z testnet. This will be a start for our active performance testing which can be re-used to go-zenon.
Mon, Dec 23, 2024, 12:08:38 - georgezgeorgez: We'll start with some small scripts etc and then eventually formalize it into some framework.
Mon, Dec 23, 2024, 12:09:21 - georgezgeorgez: I want to get back to working on the znnd_exporter to publish NoM specific metrics to prometheus/grafana
Mon, Dec 23, 2024, 12:09:47 - deeznnutz: cool - i had this action item from last meeting. I'll copy / paste here. curious to know what you think
Mon, Dec 23, 2024, 12:09:54 - deeznnutz: * Open action item from last meeting - test sync speeds baremetal vs proxmox with tool to measure performance (pending). The goal was to test if virtualization environments contribute to slow syncs. I need to improve the test process. It requires manual steps to start the python script.  
* My script to test sync speeds monitors the `syslog` and records CPU and mem usage with each new log insertion. It stores this in a local db. This was the easiest way for me to do it because I could use python. Any feedback on this approach? I could also modify geroge's go script that leverages the go-sdk. I would need to learn a little go.
Mon, Dec 23, 2024, 12:10:43 - deeznnutz: doing this in grafana would be much better
Mon, Dec 23, 2024, 12:11:33 - georgezgeorgez: I think when you are doing any sort of comparative tests, that it is very clear what the system is.
On many cloud vps, there are options for Bare Metal, Dedicated CPU, Shared CPU.
Mon, Dec 23, 2024, 12:12:05 - georgezgeorgez: Bare Metal vs Dedicated CPU with similar specs can maybe capture the performance hit of virtualization.
Mon, Dec 23, 2024, 12:12:15 - georgezgeorgez: But measuring on Shared CPU is tough.
Mon, Dec 23, 2024, 12:12:32 - georgezgeorgez: As I think it also depends on what other things the cloud provider is hosting and how busy they are.
Mon, Dec 23, 2024, 12:13:09 - georgezgeorgez: One thing we should work towards is hardware recommendations for different node types.
Regular znnd node, a pillar, a hyperqube node.
Mon, Dec 23, 2024, 12:13:16 - deeznnutz: As a side note - Toker us using Digital Ocean and the sync speeds are so slow even with 8vCPU and 32g ram it would take 3 weeks to sync
Mon, Dec 23, 2024, 12:13:27 - deeznnutz: I think vilkris saw these speeds too on other VPS.
Mon, Dec 23, 2024, 12:13:36 - georgezgeorgez: For some critical things like a pillar, I don't know if should recommend shared cpu at all.
Mon, Dec 23, 2024, 12:13:51 - georgezgeorgez: Still need to know the machine type.
Mon, Dec 23, 2024, 12:14:01 - georgezgeorgez: Not just the specs. Not all specs are equal.
Mon, Dec 23, 2024, 12:14:01 - deeznnutz: <@georgezgeorgez:hc1.chat "One thing we should work towards..."> yes this would be very helpful. Toker and I spent hours on his node.
Mon, Dec 23, 2024, 12:14:36 - georgezgeorgez: We also have seen that leveldb is single threaded
Mon, Dec 23, 2024, 12:14:44 - georgezgeorgez: or at least, can only make use of 1 core
Mon, Dec 23, 2024, 12:15:04 - georgezgeorgez: So going up to 8vcpu may not really be doing anything
Mon, Dec 23, 2024, 12:15:07 - deeznnutz: I think this should be a goal of this group to recommend specs for a pillar. We can avoid many hours of frustration by starting at the right spec
Mon, Dec 23, 2024, 12:15:19 - georgezgeorgez: it would be better to have one or two really performant cores instead
Mon, Dec 23, 2024, 12:15:45 - deeznnutz: <@georgezgeorgez:hc1.chat "it would be better to have one o..."> good point!
Mon, Dec 23, 2024, 12:16:02 - georgezgeorgez: After this meeting let's dig deeper into your second point. This is what I mean by a framework.
Mon, Dec 23, 2024, 12:16:15 - georgezgeorgez: For now I think we can run scripts and even just manually record the test results in the wiki.
Mon, Dec 23, 2024, 12:16:42 - georgezgeorgez: I've mentioned the K6 framework as well. I think by default it is geared towards web testing, as most things are.
Mon, Dec 23, 2024, 12:16:52 - georgezgeorgez: But I think it is extensible where we could make it fit our use case.
Mon, Dec 23, 2024, 12:17:13 - deeznnutz: This one https://k6.io/
Mon, Dec 23, 2024, 12:17:24 - coinselor: 🫡
Mon, Dec 23, 2024, 12:17:38 - georgezgeorgez: https://docs.digitalocean.com/products/droplets/concepts/choosing-a-plan/#shared-vs-dedicated
Mon, Dec 23, 2024, 12:17:48 - georgezgeorgez: Yeah I would really ask if Toker is using a shared cpu plan or a dedicated cpu plan
Mon, Dec 23, 2024, 12:18:10 - deeznnutz: will do that
Mon, Dec 23, 2024, 12:18:21 - georgezgeorgez: I think for something CPU intensive, not having full access to the cpu, would likely mean way less efficient use of the CPU's caches
Mon, Dec 23, 2024, 12:19:09 - georgezgeorgez: So high level principles, get dedicated cpus if you can, and opt for less but more performant cores.
we'll need to validate that, but I think that makes sense
Mon, Dec 23, 2024, 12:19:38 - deeznnutz: how do we actually test this will different CPUs. I assume just run the sync from various CPUs and measure ram and cpu usage and time to sync?
Mon, Dec 23, 2024, 12:20:49 - georgezgeorgez: So when it comes to syncing, there are two parts. There is the network part. Getting the data.
And then the verification part. Which is likely the CPU intensive part.
Mon, Dec 23, 2024, 12:21:07 - georgezgeorgez: I'd like to separate them out, so we can focus on them individually.
Mon, Dec 23, 2024, 12:21:40 - georgezgeorgez: We should look into how we can setup a test where we verify the momentums X through Y.
Mon, Dec 23, 2024, 12:21:47 - georgezgeorgez: This test will serve multiple purposes.
Mon, Dec 23, 2024, 12:21:55 - georgezgeorgez: One is performance, the other is regression.
Mon, Dec 23, 2024, 12:22:11 - deeznnutz: I assume that will require some customization within `go-zenon` to measure that?
Mon, Dec 23, 2024, 12:22:24 - georgezgeorgez: As we optimize the codebase, we want to verify that the node will handle the same transactions the exact same way.
Mon, Dec 23, 2024, 12:22:51 - georgezgeorgez: I don't think it will be a running node.
But rather call the same functions that go-zenon is calling.
Mon, Dec 23, 2024, 12:24:09 - deeznnutz: I assume that is something you or Vilkris will need to tackle?
Mon, Dec 23, 2024, 12:24:54 - georgezgeorgez: Yeah we need to define the work and get it into a work package.
Mon, Dec 23, 2024, 12:25:04 - georgezgeorgez: But baby steps with a future vision.
Mon, Dec 23, 2024, 12:25:34 - deeznnutz: how are you thinking we can use k6 to test?
Mon, Dec 23, 2024, 12:26:07 - georgezgeorgez: To test hyperqube_z testnet, I will just have a script that generates some random load against a running system.
We'll iterate on it. Switch the random load to be a predefined load. Switch the running system to be a test node without networking, etc.
Mon, Dec 23, 2024, 12:26:13 - coinselor: done catching up - can we tell if if a system is a shared cpu/dedicated/bare metal or is that something only the person who deployed it would know?
Mon, Dec 23, 2024, 12:27:01 - georgezgeorgez: Only the person who deployed it would know. Unless there is some metadata leak but that is beyond our scope.
Mon, Dec 23, 2024, 12:27:22 - georgezgeorgez: <@deeznnutz:zenon.chat "how are you thinking we can use ..."> it's a test framework that allows us to define and run tests and then helps generate reports
Mon, Dec 23, 2024, 12:27:24 - coinselor: ok, I was just hoping there would be an easy way to 'bucket' each performance benchmark somehow
Mon, Dec 23, 2024, 12:27:45 - georgezgeorgez: It's built by Grafana so it should integrate nicely as well.
Mon, Dec 23, 2024, 12:28:16 - georgezgeorgez: <@coinselor:zenon.chat "ok, I was just hoping there woul..."> Hmm so right now, the easiest thing to do will be to ask people to post some performance metrics and their hardware type.
Mon, Dec 23, 2024, 12:28:25 - georgezgeorgez: As we mature, we'll have some community members with test labs.
Mon, Dec 23, 2024, 12:28:29 - deeznnutz: is there any point in advancing my performance script. It's really only going to test sync duration and server load per momentum.
Mon, Dec 23, 2024, 12:29:03 - georgezgeorgez: I think it's fine to keep playing with it. I'm sure there will be some learnings that can be applied.
Mon, Dec 23, 2024, 12:29:41 - georgezgeorgez: I guess right now, maybe there is an actual urgent need to help a community member get operational again?
Mon, Dec 23, 2024, 12:30:02 - georgezgeorgez: And from that effort, we will have a hardware recommendation for digital ocean at least.
Mon, Dec 23, 2024, 12:30:05 - deeznnutz: So the action item is for me to advance my script for fun and start working a more robust test suite that will be part of a future work package?
Mon, Dec 23, 2024, 12:30:20 - deeznnutz: <@georgezgeorgez:hc1.chat "I guess right now, maybe there i..."> he is back up and running.
Mon, Dec 23, 2024, 12:30:29 - georgezgeorgez: okay, it's that the sync took forever?
Mon, Dec 23, 2024, 12:30:46 - deeznnutz: well he was on the edge of roping
Mon, Dec 23, 2024, 12:30:53 - deeznnutz: so I had to help him with a bootstrap
Mon, Dec 23, 2024, 12:31:03 - deeznnutz: but he wants to do the right think and sync
Mon, Dec 23, 2024, 12:31:10 - deeznnutz: and he will do that if I can help him with a solution
Mon, Dec 23, 2024, 12:31:12 - coinselor: supernova pillar candidate right there
Mon, Dec 23, 2024, 12:31:36 - georgezgeorgez: So what would be more useful immediately?
Helping community members create their own backups that they can restore from.
Or figuring out a recommended hardware.
Let's try and focus on fewer things and knock those out.
Mon, Dec 23, 2024, 12:32:08 - deeznnutz: <@georgezgeorgez:hc1.chat "So what would be more useful imm..."> local backup more important RN
Mon, Dec 23, 2024, 12:32:10 - deeznnutz: let
Mon, Dec 23, 2024, 12:32:32 - deeznnutz: I was goign to propose removing the backup / restore to DO and just backup / restore local.
Mon, Dec 23, 2024, 12:32:42 - deeznnutz: and then advance to remote backup / restore later.
Mon, Dec 23, 2024, 12:33:04 - georgezgeorgez: okay that's not a bad idea
i think we can teach people how to compress a local backup and get it off their machines
Mon, Dec 23, 2024, 12:33:28 - georgezgeorgez: could shave off like 80% of their restore time
Mon, Dec 23, 2024, 12:33:43 - georgezgeorgez: depending on how frequently they backup
Mon, Dec 23, 2024, 12:33:56 - georgezgeorgez: could work on some scripts auto backup and rotate/clean up
Mon, Dec 23, 2024, 12:34:32 - georgezgeorgez: then once that is in place, another script could take care of automatically pushing it to cloud storage
Mon, Dec 23, 2024, 12:34:36 - deeznnutz: the backup downtime is very low. I stop service, copy files to new location, start service, then compress copied files.
Mon, Dec 23, 2024, 12:35:17 - georgezgeorgez: lol, i'm just mentioning the possibility, not something to try for short term
but pillar elections give pillars an idea of when they can have downtime
Mon, Dec 23, 2024, 12:36:05 - deeznnutz: I can take this on for sure. im 50% of the way there. I think local backup / restore is a valuable feature that would help prevent centralized bootstrapping
Mon, Dec 23, 2024, 12:36:34 - coinselor: scp is not the most user friendly thing, are there any other options ?
Mon, Dec 23, 2024, 12:36:36 - deeznnutz: is there a way to monitor pillar election with the RPC?
Mon, Dec 23, 2024, 12:37:02 - georgezgeorgez: <@coinselor:zenon.chat "scp is not the most user friendl..."> mm, hosting a small webserver to do upload/download
Mon, Dec 23, 2024, 12:37:07 - deeznnutz: <@coinselor:zenon.chat "scp is not the most user friendl..."> not sure. would need to investigate. I cannot think of any. but i've only tried scp
Mon, Dec 23, 2024, 12:37:12 - coinselor: I'm upvoting this as a priority hahaha sounds awesome
Mon, Dec 23, 2024, 12:38:04 - georgezgeorgez: haha it would be pretty cool level of sophistication
pillar sees it does not need to produce a momentum this round, or for another 3 minutes
takes the opportunity for maintenance tasks
Mon, Dec 23, 2024, 12:38:23 - coinselor: i wonder if we can do some trick with ssh + a server. I'll investigate
Mon, Dec 23, 2024, 12:39:16 - georgezgeorgez: lol so a lot of open source projects have the tool as FOSS
but the management software around it depend on vendor licensing
Mon, Dec 23, 2024, 12:39:34 - georgezgeorgez: and that management software will be like a nice web GUI for things like backups etc
Mon, Dec 23, 2024, 12:39:49 - deeznnutz: So I'm going to work on local backup / restore. Brat is going to look at ways to scp / copy off site and we are picking up a group goal to establish a min spec for Pillars.
Mon, Dec 23, 2024, 12:40:47 - georgezgeorgez: Sounds good to me. coinselor I would look into a web UI as it is the most accessible I think
Mon, Dec 23, 2024, 12:41:30 - georgezgeorgez: Just a few buttons.
Backup Now / Download
Upload
Maybe it can evolve into a web based node management tool
Mon, Dec 23, 2024, 12:41:40 - deeznnutz: We also have a more comprehensive testing suite in mind which will be subject to further analysis and work planning. Hopefully we can do this on hyperqube.
Mon, Dec 23, 2024, 12:43:08 - georgezgeorgez: FWIW, there is that recent poll about running a pillar off of syrius
I think from the perspective of hosting the node on user wallet machines, I'm less sure of
But I think there is demand for friendly UX
Mon, Dec 23, 2024, 12:43:32 - coinselor: so lately I've ran into some ssh "apps" when researching charm.sh. You can literally just `ssh git.charm.sh` and you have an interactive terminal app. Another cool one is `ssh terminal.shop`. Less normie friendly, but cooler.
Mon, Dec 23, 2024, 12:43:42 - deeznnutz: ya I saw that. I have not voted. Sounds hard to implement in reality
Mon, Dec 23, 2024, 12:44:18 - georgezgeorgez: <@coinselor:zenon.chat "so lately I've ran into some ssh..."> huh okay. well i'm definitely cool with terminal apps haha. could be a good place to start
Mon, Dec 23, 2024, 12:44:36 - deeznnutz: <@coinselor:zenon.chat "so lately I've ran into some ssh..."> that looks cool
Mon, Dec 23, 2024, 12:44:49 - coinselor: This doesn't sound like the greatest idea to me, but perhaps if you let "syrius" running with just the producer key
Mon, Dec 23, 2024, 12:45:33 - deeznnutz: Did we have anything else on the agenda? that is all I had
Mon, Dec 23, 2024, 12:45:57 - georgezgeorgez: I think we have a pretty good set of action items.
Mon, Dec 23, 2024, 12:46:27 - deeznnutz: 27 Jan 2025 8 CET (1 EST) for the next meeting?
Mon, Dec 23, 2024, 12:47:11 - georgezgeorgez: sounds good
deeznnutz maybe for rotation, you can start with something like https://pypi.org/project/rotate-backups/
Mon, Dec 23, 2024, 12:47:14 - deeznnutz: I wont screw this one up.
Mon, Dec 23, 2024, 12:47:26 - deeznnutz: <@georgezgeorgez:hc1.chat "sounds good"> cool - I'll check that out
Mon, Dec 23, 2024, 12:47:28 - georgezgeorgez: Then in the future we can build more custom things
Mon, Dec 23, 2024, 12:47:41 - georgezgeorgez: just saying see what exists there already
Mon, Dec 23, 2024, 12:47:46 - georgezgeorgez: no idea if that is a good solution
Mon, Dec 23, 2024, 12:47:49 - georgezgeorgez: Good meeting guys!
Mon, Dec 23, 2024, 12:47:59 - deeznnutz: I'll investigate for sure. great meeting boyz!!
Mon, Dec 23, 2024, 12:48:04 - coinselor: happy new year !!
Mon, Dec 23, 2024, 12:48:09 - coinselor: and merry zmas
Mon, Dec 23, 2024, 12:48:10 - deeznnutz: You too!
Mon, Dec 23, 2024, 12:48:35 - georgezgeorgez: yes Happy Holidays, Merry Christmas, and Happy New Year everyone!
Mon, Dec 23, 2024, 12:48:39 - deeznnutz: === END OP SIG 23 DEC 2024 ===
Mon, Dec 23, 2024, 12:52:25 - georgezgeorgez: https://www.reddit.com/r/selfhosted/comments/rpuwzv/how_to_get_a_very_fast_single_core_vps_or_buy/
Mon, Dec 23, 2024, 12:52:48 - georgezgeorgez: but again, there may be future work to get off of leveldb into something that is more multicore friendly
Mon, Dec 23, 2024, 19:37:08 - deeznnutz: <@georgezgeorgez:hc1.chat "but again, there may be future w..."> interesting and makes sense. vilkris is able to sync on a macbook in 12-24 hours because the single thread performs very well.
Mon, Dec 23, 2024, 19:38:03 - deeznnutz: Waiting to hear back from toker on the shared CPU but that has got to be the issue. Or I hope it is. Shared CPU on a shitty CPU that could be throttled.