Operations SIG 18 Nov 2024: Difference between revisions

From Zenon Wiki
Jump to navigation Jump to search
0x3639 (talk | contribs)
0x3639 (talk | contribs)
No edit summary
Line 39: Line 39:
* Based on personal testing and anecdotal evidence from others the recommended approach for syncing a node from scratch on a VPS with non-dedicated resources should be to first sync the node on a local machine and then transfer the node's database to the server.
* Based on personal testing and anecdotal evidence from others the recommended approach for syncing a node from scratch on a VPS with non-dedicated resources should be to first sync the node on a local machine and then transfer the node's database to the server.
* Syncing a node locally on my machine only takes around 13 hours, while on a VPS with shared resources it can take over a week. This would suggest that LevelDB is not the main culprit for the slow sync, raising into question how much time should be spent on investigating the replacement of LevelDB right now.
* Syncing a node locally on my machine only takes around 13 hours, while on a VPS with shared resources it can take over a week. This would suggest that LevelDB is not the main culprit for the slow sync, raising into question how much time should be spent on investigating the replacement of LevelDB right now.
== Meeting Minutes Summary (chatGPT) ==
'''Summary of Meeting Minutes (OP SIG - 23 Dec 2024)'''
'''Updates from Previous Meeting:'''
1. '''Completed''':
• Troubleshooting script deployed and tested.
• go-hyperqube released with new deployment flags.
2. '''In Progress''':
• Local backup and restore submitted, pending revisions.
• GUI improvements for deploy script under consideration.
'''Discussions and Actionable Insights:'''
1. '''Performance Testing''':
• Begin performance testing on hyperqube_z testnet.
• Consider frameworks (e.g., K6) for testing and future automation.
2. '''Hardware Recommendations''':
• Goal to recommend specs for Pillars, emphasizing dedicated CPUs with fewer but high-performance cores.
• Test and document sync speeds on various environments (bare-metal, virtualized, etc.).
3. '''Local Backup/Restore''':
• Shift focus to implement a local backup/restore system first.
• Future plan to enhance it with remote/cloud backup solutions.
4. '''Testing Framework''':
• Need a comprehensive suite for sync speed tests and regression validation.
• Scripts to separate network and verification parts for focused testing.
5. '''Potential Enhancements''':
• Web UI for easier backup/restore operations.
• Improvements to sync processes for nodes using efficient hardware configurations.
'''Challenges Identified:'''
• Sync speeds on shared CPU systems are significantly slower.
• LevelDB’s single-threaded performance is a bottleneck.
• Community members need accessible tools for backup/restore.
'''Action Items Due'''
1. '''Backup/Restore''':
• Complete and deploy the local backup/restore feature (assigned: deeznnutz).
• Investigate user-friendly methods to copy backups offsite (assigned: Brat).
2. '''Performance Testing''':
• Continue iterating on scripts to measure sync duration and server load per momentum.
• Plan a future work package for comprehensive testing frameworks (assigned: deeznnutz & Georgez).
3. '''Hardware Recommendations''':
• Begin documenting sync tests with different hardware configurations and environments.
• Investigate and validate dedicated CPU plans for optimized performance.
4. '''Testing Frameworks''':
• Research integration of K6 for load testing hyperqube_z (assigned: Georgez).
• Explore script-based automation to support regression and performance tests.
5. '''Community Support''':
• Provide interim support for community members struggling with node setups, focusing on bootstrap and sync solutions.
6. '''Miscellaneous''':
• Look into SSH-based tools and web UIs for user-friendly node management (assigned: Coinselor).
'''Next Meeting'''
• '''Date''': 27 Jan 2025, 8 CET (1 EST).


== Meeting Minutes Full ==
== Meeting Minutes Full ==

Revision as of 12:15, 24 December 2024

Agenda

What: Meeting to Discuss Improving Node Operations as part of the HC1: OP SIG

When: 19 Nov 2024 @ 8 CET EST

Where: https://matrix.to/#/#sig-op:hc1.chat

Chair: 0x3639

Agenda:

  1. Discuss follow Up items from previous meeting
  2. Document action items
  3. Establish next meeting

If you want to attend please respond (or DM) with your full matrix username and I will invite you to the group. No FUD, anger or BS allowed.

Pre-meeting Notes

0x3639

  • Created a troubleshooting script that runs a series of actions that help trouble shoot go-zenon. Runs basic linux commands to check the service, disk space, UFW, and then looks at logs and looks at some node endpoints.
  • Created a bootstrap / restore script that stops go-zenon, backups and compresses the necessary files, and then restarts go-zenon
  • I've been testing locally and need to submit a PR.

George

  • Traveling this week so probably can't attend the meeting
  • I have the znnd_exporter (prometheus metrics) code ready. Working on the dashboard and getting it auto-installed
  • I want to make sure we start planning for the HyperQube Network Launch Ops support work

Coinselor

Vilkris

  • Created a branch in which the database references are explicitly released. Commit message for context:
    • This commit introduces explicit releasing of database handles. The LevelDB package relies on the Go GC to cleanup unused snapshot references, but many other database packages require snapshots to be released explicitly. These changes serve as a starting point for assessing the usage of alternative databases.
  • Releasing the DB references manually provides no apparent improvement in performance - possibly a negative effect in performance. Would need more testing to determine the effect.
  • Overall the task of manually managing the references is very tedious (and complicated inside the account pool) and as can be seen from the amount of changes done in the branch, it is not a trivial change and affects a vast portion of the codebase.
  • Based on personal testing and anecdotal evidence from others the recommended approach for syncing a node from scratch on a VPS with non-dedicated resources should be to first sync the node on a local machine and then transfer the node's database to the server.
  • Syncing a node locally on my machine only takes around 13 hours, while on a VPS with shared resources it can take over a week. This would suggest that LevelDB is not the main culprit for the slow sync, raising into question how much time should be spent on investigating the replacement of LevelDB right now.

Meeting Minutes Summary (chatGPT)

Summary of Meeting Minutes (OP SIG - 23 Dec 2024)

Updates from Previous Meeting:

1. Completed:

• Troubleshooting script deployed and tested.

• go-hyperqube released with new deployment flags.

2. In Progress:

• Local backup and restore submitted, pending revisions.

• GUI improvements for deploy script under consideration.

Discussions and Actionable Insights:

1. Performance Testing:

• Begin performance testing on hyperqube_z testnet.

• Consider frameworks (e.g., K6) for testing and future automation.

2. Hardware Recommendations:

• Goal to recommend specs for Pillars, emphasizing dedicated CPUs with fewer but high-performance cores.

• Test and document sync speeds on various environments (bare-metal, virtualized, etc.).

3. Local Backup/Restore:

• Shift focus to implement a local backup/restore system first.

• Future plan to enhance it with remote/cloud backup solutions.

4. Testing Framework:

• Need a comprehensive suite for sync speed tests and regression validation.

• Scripts to separate network and verification parts for focused testing.

5. Potential Enhancements:

• Web UI for easier backup/restore operations.

• Improvements to sync processes for nodes using efficient hardware configurations.

Challenges Identified:

• Sync speeds on shared CPU systems are significantly slower.

• LevelDB’s single-threaded performance is a bottleneck.

• Community members need accessible tools for backup/restore.

Action Items Due

1. Backup/Restore:

• Complete and deploy the local backup/restore feature (assigned: deeznnutz).

• Investigate user-friendly methods to copy backups offsite (assigned: Brat).

2. Performance Testing:

• Continue iterating on scripts to measure sync duration and server load per momentum.

• Plan a future work package for comprehensive testing frameworks (assigned: deeznnutz & Georgez).

3. Hardware Recommendations:

• Begin documenting sync tests with different hardware configurations and environments.

• Investigate and validate dedicated CPU plans for optimized performance.

4. Testing Frameworks:

• Research integration of K6 for load testing hyperqube_z (assigned: Georgez).

• Explore script-based automation to support regression and performance tests.

5. Community Support:

• Provide interim support for community members struggling with node setups, focusing on bootstrap and sync solutions.

6. Miscellaneous:

• Look into SSH-based tools and web UIs for user-friendly node management (assigned: Coinselor).

Next Meeting

Date: 27 Jan 2025, 8 CET (1 EST).

Meeting Minutes Full

Tue, Nov 19, 2024, 12:59:59 - deeznnutz: === START OP SIG 19 NOV 2024 ===

Tue, Nov 19, 2024, 13:00:00 - deeznnutz: GM

Tue, Nov 19, 2024, 13:00:22 - vilkris: Gm

Tue, Nov 19, 2024, 13:00:32 - coinselor: hihi

Tue, Nov 19, 2024, 13:00:41 - deeznnutz: I think we can move quickly today.  thx for pushing it one day

Tue, Nov 19, 2024, 13:00:58 - deeznnutz: vilkris: I went through your update.  Is there anything you wanted to add.

Tue, Nov 19, 2024, 13:01:05 - deeznnutz: I have a proposed test

Tue, Nov 19, 2024, 13:01:36 - vilkris: Not much to add, but I do think it's good if others can replicate the results

Tue, Nov 19, 2024, 13:01:55 - deeznnutz: Proposed test to confirm Vilkris' results

1\) Test 1: sync `go-zenon` directly on this machine - Supermicro SuperServer 5018D-FN8T Xeon D 1U Rackmount,10GbE,SFP+,32GB & 512GB M.2

2\) Test 2: install proxmox on the same machine, and allocate 100% of the available resources to a single VM and perform the same sync

Tue, Nov 19, 2024, 13:02:02 - deeznnutz: what do you think about this test?

Tue, Nov 19, 2024, 13:02:28 - deeznnutz: with this we can determine if the hypervisor is causing an issue

Tue, Nov 19, 2024, 13:02:57 - vilkris: Sounds good to me. I'm not an expert in that area

Tue, Nov 19, 2024, 13:03:40 - vilkris: It's also easy for anyone to confirm by syncing the node locally on their machine

Tue, Nov 19, 2024, 13:04:16 - vilkris: I was supposed to try the sync on a Mac M1 but didn't get around to that yet

Tue, Nov 19, 2024, 13:04:26 - deeznnutz: I have a lot of crap going on with my mac.  I wanted to isolate that and use a clean machine so nothing else is competing for resources

Tue, Nov 19, 2024, 13:04:37 - coinselor: George highlighted the need to not just take sync times at face value but attach each test to comprehensive specs:

"Probably want to give details like:

cloud provider

vm type

specs"

Does the troubleshooting script deeznnutz you made output this information? If not, I can look into adding any missing information to it that might be useful. Maybe some internet speed test?


Tue, Nov 19, 2024, 13:05:14 - deeznnutz: <@coinselor:zenon.chat "George highlighted the need to n..."> it does not.  but that is a good idea

Tue, Nov 19, 2024, 13:05:31 - deeznnutz: i did not have inet speed or ping.  those could be helpful for sure

Tue, Nov 19, 2024, 13:06:36 - deeznnutz: I'll run this test.  On this server.  I'm literally setting it up now.  At a minimum we can determine if a hypervisor is causing an issue. if not we can start testing more stuff / options.

Tue, Nov 19, 2024, 13:07:58 - vilkris: <@coinselor:zenon.chat "George highlighted the need to n..."> Regarding the different specs on cloud providers we might also want to add a way to time the sync from height 1 to some pre defined height, like 8M

Tue, Nov 19, 2024, 13:08:27 - vilkris: Otherwise it's hard to accurately time the sync

Tue, Nov 19, 2024, 13:08:51 - deeznnutz: is there something you can do to stop  the sync at 8m and implement a simple timer?

Tue, Nov 19, 2024, 13:09:47 - vilkris: Without modifying go-zenon I'm not sure

Tue, Nov 19, 2024, 13:10:07 - deeznnutz: we can parse the logs

Tue, Nov 19, 2024, 13:10:25 - vilkris: Yeah that might work

Tue, Nov 19, 2024, 13:10:26 - coinselor: couldn't we hack it to kill the process parsing the grafana logs znnd_exporter george is making?

Tue, Nov 19, 2024, 13:10:28 - deeznnutz: I can do it with a bash script just watching the logs

Tue, Nov 19, 2024, 13:11:05 - deeznnutz: I can also use monit.

Tue, Nov 19, 2024, 13:11:20 - deeznnutz: let me see what I can hack together without touching go-zenon

Tue, Nov 19, 2024, 13:11:41 - vilkris: The logs seem like the easiest approach

Tue, Nov 19, 2024, 13:11:48 - deeznnutz: ya, agree

Tue, Nov 19, 2024, 13:12:26 - vilkris: They are timestamped I think it should record when a momentum is inserted, at least on the debug level

Tue, Nov 19, 2024, 13:13:02 - vilkris: Just need to also record the first momentum's time as well

Tue, Nov 19, 2024, 13:13:20 - deeznnutz: yep.  I have an idea how to do that really easily.

Tue, Nov 19, 2024, 13:13:27 - vilkris: Okay nice

Tue, Nov 19, 2024, 13:13:41 - vilkris: That would make it easy to compare results if it's always the same amount of momentums

Tue, Nov 19, 2024, 13:14:19 - deeznnutz: Moving on to the troubleshooting flag for our script.

Tue, Nov 19, 2024, 13:14:45 - deeznnutz: The amount of Time I spend on go-zenon trouble shooting is pretty high.  

Tue, Nov 19, 2024, 13:15:31 - coinselor: We might be also able to parse the logs of the time series data the znnd exporter creates to say figure out how long it took to get to a specific milestone like 1M momentums, but I'm not sure.

Tue, Nov 19, 2024, 13:15:54 - deeznnutz: <@coinselor:zenon.chat "We might be also able to parse t..."> that is the work that george is actually working on

Tue, Nov 19, 2024, 13:16:09 - coinselor: as far as I understand it basically just reformats the output from znnd logs

Tue, Nov 19, 2024, 13:17:28 - deeznnutz: george wrote that code in `go` to track the speed of momentum production.  This simple script I can use is just temp until george deploys his code and we can visualize it in grafana.  

Tue, Nov 19, 2024, 13:18:20 - deeznnutz: Anything else on this before we move to troubleshooting?

Tue, Nov 19, 2024, 13:18:37 - vilkris: Not from me

Tue, Nov 19, 2024, 13:19:00 - deeznnutz: Cool.  So the time required to trouble shoot is very high.

Tue, Nov 19, 2024, 13:19:20 - deeznnutz: people who do not know how to use linux - it's very hard to get basic information.

Tue, Nov 19, 2024, 13:19:31 - deeznnutz: Like... is the hard drive out of space.

Tue, Nov 19, 2024, 13:19:54 - deeznnutz: So I will publsh this script to help with troubleshooting and coinselor maybe you can review and improve

Tue, Nov 19, 2024, 13:20:15 - deeznnutz: My question is, what is the best way to get the output to me without copy / paste

Tue, Nov 19, 2024, 13:20:28 - deeznnutz:  we have sever limits on what some can do

Tue, Nov 19, 2024, 13:20:46 - deeznnutz: I was going to try to push the results to a TG bot

Tue, Nov 19, 2024, 13:20:54 - deeznnutz: but need to expose API keys to do that

Tue, Nov 19, 2024, 13:22:24 - coinselor: Yeah, I'm sure we can think of something to publish the data somewhere. There's no sensitive information right? It's basically specs

Tue, Nov 19, 2024, 13:22:45 - deeznnutz: I think george is very reluctatnt to ask pillars to expose their IP

Tue, Nov 19, 2024, 13:23:02 - deeznnutz: I could setup an API to receive the data but it will expose the IP

Tue, Nov 19, 2024, 13:23:11 - deeznnutz: but not to me if they push to TG

Tue, Nov 19, 2024, 13:24:20 - deeznnutz: So I was going to encrypt the TG API keys and use some tool to decrypt them when sending the troubleshooting results to TG

Tue, Nov 19, 2024, 13:24:35 - deeznnutz: but I'm open to suggestions

Tue, Nov 19, 2024, 13:24:49 - coinselor: We should look into it. We can copy paste for now. Maybe there's a good solution using nostr/tor or something.

Tue, Nov 19, 2024, 13:25:21 - deeznnutz: ya maybe we can dive in a little more and see what we can come up with.  

Tue, Nov 19, 2024, 13:26:25 - vilkris: Don't have any ideas off the top of my head. But I can understand that even copy pasting stuff can be a real pain point

Tue, Nov 19, 2024, 13:26:55 - deeznnutz: <@vilkris:hc1.chat "Don't have any ideas off the top..."> ya, I get a lot of screen shots... which is why I would love a file.  

Tue, Nov 19, 2024, 13:27:12 - deeznnutz: Maybe we can take it offline and come up with a solution

Tue, Nov 19, 2024, 13:27:14 - coinselor: <@deeznnutz:zenon.chat "So I will publsh this script to ..."> Sounds good btw. I'll also look into creating a wrapper in go for the scripts as a way for me to get started/familiarized with go. I also wanna make it pretty with that Bubble Tea framework george keeps sharing

Tue, Nov 19, 2024, 13:27:41 - deeznnutz: ya that would be cool.  Love that Bubble Tea Framework

Tue, Nov 19, 2024, 13:28:12 - deeznnutz: I have the backup / restore script working well.  I just need to submit it for review

Tue, Nov 19, 2024, 13:28:44 - deeznnutz: and george gave an update on his work.  He needs to make the dashboard and make the auto-install

Tue, Nov 19, 2024, 13:29:21 - deeznnutz: Finally the HQZ work.  I think reusing our scripts for HQZ will be pretty easy

Tue, Nov 19, 2024, 13:30:27 - vilkris: <@deeznnutz:zenon.chat "I have the backup / restore scri..."> I'm assuming this is only for restoring from a local backup?

Tue, Nov 19, 2024, 13:30:32 - deeznnutz: Yes

Tue, Nov 19, 2024, 13:30:52 - deeznnutz: I have a separate one where it pulls from DO, but I'm not going to publish that

Tue, Nov 19, 2024, 13:31:01 - deeznnutz: I backup daily to DO just in case

Tue, Nov 19, 2024, 13:32:48 - vilkris: Okay, just thinking that if the new suggested approach is to sync the node locally then being able to easily transfer the node data onto the server would be useful

Tue, Nov 19, 2024, 13:33:02 - vilkris: Not sure how much work it would be to add that type of functionality to the script

Tue, Nov 19, 2024, 13:33:19 - deeznnutz: actually not that bad

Tue, Nov 19, 2024, 13:33:36 - deeznnutz: its just a .tar, scp, and untar

Tue, Nov 19, 2024, 13:34:02 - deeznnutz: that's what I do today to send to DO, but with the S3CLI

Tue, Nov 19, 2024, 13:34:21 - coinselor: We could look into offering connector's to large cloud providers but I'm sure there's probably tools already built we could leverage. And maybe out of scope.

Tue, Nov 19, 2024, 13:34:46 - deeznnutz: S3CLI is a great tool, but you need an S3 endpoint

Tue, Nov 19, 2024, 13:34:49 - vilkris: Gotcha. Well we can think of that more once we've confirmed that local syncing is a solution

Tue, Nov 19, 2024, 13:34:50 - coinselor: Another minor concern would be an user choosing i.e AWS over others because we have a bootstrap/backup feature on a script or something like that

Tue, Nov 19, 2024, 13:35:52 - deeznnutz: We can add an S3 endpoint for sure.  I have that setup today.  It just requires an s3 config file.  

Tue, Nov 19, 2024, 13:36:20 - deeznnutz: but maybe we start with local backup and restore and move to S3 and/or SCP based on test results?

Tue, Nov 19, 2024, 13:37:03 - deeznnutz: Anything else to discuss?

Tue, Nov 19, 2024, 13:37:29 - deeznnutz: Next meeting 16 Dec 24 @ 8PM UTC?

Tue, Nov 19, 2024, 13:38:05 - coinselor: Looks good to me. Nothing to add.

Tue, Nov 19, 2024, 13:38:17 - deeznnutz: Cool - thx guys!!

Tue, Nov 19, 2024, 13:38:29 - vilkris: Thanks!

Tue, Nov 19, 2024, 13:38:39 - deeznnutz: === END OP SIG 19 NOV 2024 ===