sending objects/files over websockets
Introduction
Recently I have become interested in better workflows for machine learning projects. One of the main issues I’ve run into is developing locally but not being able to test locally for various reasons. Generally this is from my local computer being a bit slower and older for the types of models and datasets I am working on and not architecting the code from the start to accommodate a slower computer than the server I may actually be training on. Another instance I have had this happen is if the remote server is behind a VPN and a relay hop and there is a bit of a latency between saving the file and running it if I am using something such as VSCode (although you can try something such as a ProxyJump in your ssh config). Regardless of the problem, I have yet to find an actual workflow that I find versatile and simple.
One of my recent ideas was related to using websockets to push a python object to a remote server and then executing that. I’m sure anyone reading this will understand that it is “unsafe” in many regards. Regardless, I think the idea of being able to quickly (with a single command or hotkey) run a model while you are iterating on it without having to commit it is incredibly useful (as that is one of the other frequent workflows I have seen).
Implementation
To implement this idea, you will need the dill package. This works by serializing the class/function in that folder and then sending it over TCP and running it on the remote.
Here is the basic example for the client:
import asyncio
import os
import dill
server_url = os.environ.get("SERVER_URL", "127.0.0.1")
server_port = os.environ.get("SERVER_PORT", 8888)
class Foo:
def __init__(self, val=10):
self.val = val
def run(self):
print(f"running Foo.run() with val={self.val}")
async def tcp_client():
obj = Foo(val=1)
data = dill.dumps(obj)
reader, writer = await asyncio.open_connection("127.0.0.1", 8888)
writer.write(dill.dumps(obj, recurse=True))
writer.close()
if __name__ == "__main__":
asyncio.run(tcp_client())
and then the server:
import asyncio
import dill
server_url = os.environ.get("SERVER_URL", "127.0.0.1")
server_port = os.environ.get("SERVER_PORT", 8888)
async def handle_func(reader, writer):
data = await reader.read(-1)
obj = dill.loads(data)
addr = writer.get_extra_info("peername")
print(f"got obj: {obj} - addr: {addr}")
obj.run()
writer.close()
async def main():
server = await asyncio.start_server(handle_func, "127.0.0.1", 8888)
addr = server.sockets[0].getsockname()
print(f"Serving on {addr}")
async with server:
await server.serve_forever()
asyncio.run(main())
Most of this is similar to the example in the official python docs. While this works, it won’t work so well if your model/project starts going beyond one file as serializing the related files and loading it on the remote was not something I was able to figure out. I’m still very interested in this but from looking at marshal
, pyro5
, dill
, etc. I was not able to get it to work correctly if for instance you have a run.py
and a model.py
.
After messing around with this and having issues getting dill
to work with a python class outside of the file running, I wanted to see if it was feasible with another method.
Sending a folder
Instead the alternative solution for perhaps a bigger project involves creating a tar of the project, sending the tar over websocket (although at this point websocket is not so important, its only advantage over something like HTTP may be something like piping results/inputs back and forth).
This time the client looks like so:
import asyncio
import tarfile
server_url = os.environ.get("SERVER_URL", "127.0.0.1")
server_port = os.environ.get("SERVER_PORT", 8888)
async def tcp_send_folder(folder):
with open(folder, "rb") as f:
data = f.read()
reader, writer = await asyncio.open_connection(server_url, server_port)
writer.write(data)
writer.close()
await writer.wait_closed()
def create_tarfile(folder):
f = "out/client/send.tar.gz"
with tarfile.open(f, mode="w:gz") as tar:
tar.add(folder)
return f
if __name__ == "__main__":
asyncio.run(tcp_send_folder(create_tarfile("src")))
and the server as such:
import asyncio
import os
import tarfile
server_url = os.environ.get("SERVER_URL", "127.0.0.1")
server_port = os.environ.get("SERVER_PORT", 8888)
def run_folder(folder):
tar = tarfile.open(folder)
tar.extractall(path="out/run")
tar.close()
from out.run.src import main
main.run()
async def handle_folder(reader, writer):
addr = writer.get_extra_info("peername")
outfile = "out/server/out.tar.gz"
with open(outfile, "wb") as f:
while True:
data = await reader.read(1024)
if not data:
break
f.write(data)
writer.close()
await writer.wait_closed()
run_folder(outfile)
print("done...")
async def main():
server = await asyncio.start_server(handle_folder, server_url, server_port)
addr = server.sockets[0].getsockname()
print(f"Serving on {addr}")
async with server:
await server.serve_forever()
asyncio.run(main())
Other
Its hard to say how useful this is, I think theres potential for it to be used in ML/DL research/experimentation but it largely depends on a users workflow. There are other projects such as DVC CML and maybe other alternatives but rely on using a git commit and pushing to a repo to run the experiments. Theres probably other ways to do this such as creating an executable with pyinstaller/cx_freeze (slow) or making the core functionality serializable (not ideal) but I have yet to find a best way to do this.