Using Kubernete Jobs for one off ingestion of CSV's
Running Postgres on kubernetes locally
While this may be overkill, its better than configuring a kubernetes cluster on gcloud or whatever else and if done correctly will translate to a cloud service we can later use in a production system while allowing us to focus on micro services individually.
To start Kubernetes locally
I am using Docker for mac which comes with Kubernetes v1.9.8 as of this time of writing and while it may not perfectly replicate a development/staging/production environment, I find it to be much more straightforward to develop in this manner due to many of the new Kubernetes tooling. While we could use docker swarm or compose, kubernetes seems to be a more versatile use case for the hope of not only running an app but creating pipelines to ingest and maintain a stable configuration that could be used in the cloud (and much more resembles how you can develop locally then roll out to kubernetes systems).
Kubernetes Dashboard
While this will be done by most cloud providers, its useful to initiate since running . Luckily kubernetes-helm has come a long way. To install helm, you can simply use brew install kubernetes-helm
followed by helm init
and then helm repo update
. While its also possible to just use dashboard.yaml config from https://github.com/kubernetes/dashboard using helm allows us to not have to be bothered by the nuances for docker-for-desktop system helm install stable/kubernetes-dashboard
This might seem trivial but the less kubernetes config files to worry about the better from my experience (more so when you are using preconfigured services). Your kubernetes config files should be looked over and understood by you personally and a majority of the basic apps you may wish to use will have preconfigured helm packages which makes the kubernetes ecosystem much more straightforward while still allowing a considerable amount of configuration from helm.
Sidenote for running kubernetes locally:
For some reason the directions on the official docs may not work for you if you are running a kubernetes cluster locally. To fix this, we would need to create a tiller service account and a ClusterRoleBinding which is used to grant permission at the cluster level and in all namespaces.
kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller
This will be put into a kubernetes config yaml for clarity and script initiating purposes in the future but since we are just setting up the kubernetes dashboard locally it is not necessary for all kubernetes deployments.
With that done, you can simply run helm install stable/kubernetes-dashboard
View the dashboard
Since helm allows both simple installs and the use of config files (which we did not use), the install will have explicit directions to allow you to proxy to the dashboard upon installation. For fish it is just
export POD_NAME=(kubectl get pods -n default -l "app=kubernetes-dashboard,release=interested-marsupial" -o jsonpath="{.items[0].metadata.name}")
kubectl -n default port-forward $POD_NAME 8443:8443
at which point you will be able to reach the dashboard (which you would otherwise see in some cloud provider by default) at localhost:8443
Side-note: when doing this sort of development system, it becomes useful to start writing kubernetes/docker functions in your shell since. I have a ton I’ve started using but an example for this use case in fish shell would be:
function kube_portforward
set -l DEPLOYMNET_NAME $argv
set -l POD_NAME (kubectl get pods -n default -l "app=$DEPLOYMNET_NAME" -o jsonpath="{.items[0].metadata.name}")
set -l POD_PORT (kubectl get pods -n default -l "app=$DEPLOYMNET_NAME" -o jsonpath="{.items[0].spec.containers[0].ports[0].containerPort}")
echo "Forwarding to https://localhost:$POD_PORT"
kubectl -n default port-forward $POD_NAME $POD_PORT
end
Installing Postgres
Postgres is another easy stepping stone once its up and running. While theoretically the data will be held in its own table, to import that data into postgres might be unnecessary based on the data size, but due to the premise of using kubernetes, it makes sense to ingest the data from a job that will start the data pipeline for new a dataset (i.e. using a pretrained model with some configuration per the new dataset, transfer learn on dataset and roll out model in some deployment capacity. Once again we can use helm to have the basic infrastructure of a postgres db and create a job to ingest the data into a table.
The basic command will be something along the helm install -f k8s/helm/postgres_values.yaml stable/postgresql
At this point we have the basic infrastructure setup
This will allow us to train models using/transfer learning and most importantly the ability to create deployments that will specialize on a generalized dataset while allowing us to simply provide details about how we wish to use the pretrained model and how we wish to tune and serve our model. Initially this will be quite simplistic and will be covered in the next post.
Data Ingesting
The data itself is quite simple, and will look contain something along the likes of these columns:
- 'call_id'
- 'status'
- 'first_name'
- 'id'
- 'content'
- 'filename'
- 'outcome_ids'
- 'occurred_at'
- 'call_type_name'
- 'call_length_sec'
- 'is_customer'
- 'start_time'
- 'end_time'
- 'tag'
- 'prices'
- 'category_label'
- 'rep_group_type'
- 'ranker'
- 'call_start_time'
- 'nchars'
- 'nwords
Example of data to ingest
An example of what this will look like (not all rows included):
We can further look into these features later but the data being somewhat cleaned and munged allows us to not worry about a lot of trivialness. (In the next section I will have the pipeline of ingestion and the ideas behind incorporating data to a model that will have some usefulness.)
While ingesting the data has a multitude of ways, since we are simply using the data from a CSV at this point, but want the capability of the postgres DB for the ability to hold a variety disparate datasets we will need to create a job that loads the data and then kicks off the model training and deployment in a way that would be useful for production purposes. The next section will be much more ML oriented but this section shows how to get the initial cluster running locally that would extrapolate to cloud providers along with being able to ingest and adapt our models to the data in a robust and systematic way.
While the job to ingest the data will be a kubernetes job (to maintain the pipeline of creating datasets which we can then train on) will be described in the next section, the starting point is simply just using pgfutter csv dataset.csv
which allows us to ingest and create the table without having to worry about columns/tables/etc. While this may be completely unnecessary for smaller datasets that are CSV’s, the ability to hold the dataset in a standardized location and using a database and having a dashboard to monitor our deployments creates a well rounded framework we can easily begin to tack on peripheral services.
Why are you using postgres?
The idea is that regardless of how the data comes in we will have a standardized pipeline to ingest data and create models. While using the csv directly would be adequate for this machine learning model, perhaps the data will be ingested from another service (i.e. a service that ties into a direct phone call and using a some speech-to-text api we can have another job that transcribes the call and imports the data into our DB.
For our task since we have a csv, we will be using https://github.com/lukasmartinelli/pgfutter.
Its a super useful tool for this sort of pipeline as it allows a streamlined tool to ingest the data if we don’t want to worry about data munging or stuff that can be used later on in the pipeline. Also as a go executable this allows us to simply create a job that will ingest the data and from there cascade the data into the ML pipeline rather than the other tools that are frequently recommended.
For this sort of job we will want to use a kubernetes job coupled with a config file that is kicked off with the idea that it would be able to be much more translatable to a helm chart or template in the future
for example the configmap would be:
apiVersion: v1
kind: ConfigMap
metadata:
name: config-csv-ingest
namespace: default
data:
PGFUTTER_REPO: "github.com/lukasmartinelli/pgfutter"
DB_HOST: postgres
DB_NAME: postgres
DB_USER: postgres
DB_PASS: password
CSV_LOCATION: "/data.csv"
DATA_URL: "https://storage.googleapis.com/neural-niche/raw_text/csv/898f0e56-6a36-491a-9c43-3fed49d6a3d0/calldata.csv"
which allows us to use the env’s although things like DATA_URL would be passed in when job creation is started.
apiVersion: batch/v1
kind: Job
metadata:
name: ingest
spec:
template:
metadata:
name: ingest
labels:
datatype: csv
spec:
containers:
- name: pgfutter
image: golang:latest
envFrom:
- configMapRef:
name: config-csv-ingest
command: ["/bin/sh", "-c"]
args:
- go get $(PGFUTTER);
wget -O $(CSV_LOCATION) $(DATA_URL);
pgfutter csv -d \",\" $(CSV_LOCATION);
restartPolicy: Never
This gets us the basic idea of how to use jobs to ingest data into our k8s cluster. While this by itself is trivial and unnecessary it ties together in a nice pipeline by later on allowing us to kick start jobs and training of models.