sending objects/files over websockets
Updates
1/24/23 Ran into this library cloudpickle and it reminded me of this post.
3/18/21: Found this outrun which seems to be basically a better and more thought out way to do a lot of this.
Introduction
Recently I became interested in better workflows for machine learning projects. One of the main issues I’ve experienced is developing locally but not being able to easily test locally. Generally this is from a local computer being a bit slower and older for the types of models and datasets I am working with and initial code decisions that resulted in poor performance on a slower computer than the server used to train. Another instance I have had this happen is if the remote server is behind a VPN and a relay hop and there is a bit of a latency between saving the file and running it if I am using something such as VSCode (although you can try something such as a ProxyJump in your ssh config). Regardless of the problem, I have yet to find an actual workflow that I find versatile and simple.
One of my recent ideas was using websockets to push a python object to a remote server and then executing that. I’m sure anyone reading this will understand that it is “unsafe” in many regards. Regardless, I think the idea of being able to quickly (with a single command or hotkey) run a model while you are iterating on it without having to commit it is incredibly useful (as that is one of the other frequent workflows I have seen).
Implementation
To implement this idea, you will need the dill package. This works by serializing the class/function in that folder and then sending it over TCP and running it on the remote.
Here is the basic example for the client:
import asyncio
import os
import dill
server_url = os.environ.get("SERVER_URL", "127.0.0.1")
server_port = os.environ.get("SERVER_PORT", 8888)
class Foo:
def __init__(self, val=10):
self.val = val
def run(self):
print(f"running Foo.run() with val={self.val}")
async def tcp_client():
obj = Foo(val=1)
data = dill.dumps(obj)
reader, writer = await asyncio.open_connection("127.0.0.1", 8888)
writer.write(dill.dumps(obj, recurse=True))
writer.close()
if __name__ == "__main__":
asyncio.run(tcp_client())
and then the server:
import asyncio
import dill
server_url = os.environ.get("SERVER_URL", "127.0.0.1")
server_port = os.environ.get("SERVER_PORT", 8888)
async def handle_func(reader, writer):
data = await reader.read(-1)
obj = dill.loads(data)
addr = writer.get_extra_info("peername")
print(f"got obj: {obj} - addr: {addr}")
obj.run()
writer.close()
async def main():
server = await asyncio.start_server(handle_func, "127.0.0.1", 8888)
addr = server.sockets[0].getsockname()
print(f"Serving on {addr}")
async with server:
await server.serve_forever()
asyncio.run(main())
Most of this is similar to the example in the official python docs. While this works, it won’t work so well if your model/project starts going beyond one file as serializing the related files and loading it on the remote was not something I was able to figure out. I’m still very interested in this but from looking at marshal
, pyro5
, dill
, etc. I was not able to get it to work correctly if for instance you have a run.py
and a model.py
.
After messing around with this and having issues getting dill
to work with a python class outside of the file running, I wanted to see if it was feasible with another method.
Sending a folder
Instead the alternative solution for perhaps a bigger project involves creating a tar of the project, sending the tar over websocket (although at this point websocket is not so important, its only advantage over something like HTTP may be something like piping results/inputs back and forth).
This time the client looks like so:
import asyncio
import tarfile
server_url = os.environ.get("SERVER_URL", "127.0.0.1")
server_port = os.environ.get("SERVER_PORT", 8888)
async def tcp_send_folder(folder):
with open(folder, "rb") as f:
data = f.read()
reader, writer = await asyncio.open_connection(server_url, server_port)
writer.write(data)
writer.close()
await writer.wait_closed()
def create_tarfile(folder):
f = "out/client/send.tar.gz"
with tarfile.open(f, mode="w:gz") as tar:
tar.add(folder)
return f
if __name__ == "__main__":
asyncio.run(tcp_send_folder(create_tarfile("src")))
and the server as such:
import asyncio
import os
import tarfile
server_url = os.environ.get("SERVER_URL", "127.0.0.1")
server_port = os.environ.get("SERVER_PORT", 8888)
def run_folder(folder):
tar = tarfile.open(folder)
tar.extractall(path="out/run")
tar.close()
from out.run.src import main
main.run()
async def handle_folder(reader, writer):
addr = writer.get_extra_info("peername")
outfile = "out/server/out.tar.gz"
with open(outfile, "wb") as f:
while True:
data = await reader.read(1024)
if not data:
break
f.write(data)
writer.close()
await writer.wait_closed()
run_folder(outfile)
print("done...")
async def main():
server = await asyncio.start_server(handle_folder, server_url, server_port)
addr = server.sockets[0].getsockname()
print(f"Serving on {addr}")
async with server:
await server.serve_forever()
asyncio.run(main())
Other
Its hard to say how useful this is, I think theres potential for it to be used in ML/DL research/experimentation but it largely depends on a users workflow. There are other projects such as DVC CML and maybe other alternatives but rely on using a git commit and pushing to a repo to run the experiments. Theres probably other ways to do this such as creating an executable with pyinstaller/cxfreeze (slow) or making the core functionality serializable (not ideal) but I have yet to find a _best way to do this.