using kubernete jobs for one off ingestion of csv's
Running Postgres on kubernetes locally
While setting up Kubernetes locally might seem like overkill for one off data ingestion tasks, it provides several advantages:
- Creates a consistent development environment that mirrors production
- Allows testing of Kubernetes configurations before cloud deployment
- Enables development of microservices in isolation
- Provides a foundation for scaling your ML pipeline
For this tutorial, we’ll use Docker for Mac with its built-in Kubernetes support (v1.9.8). This setup offers a straightforward development experience with modern Kubernetes tooling while maintaining compatibility with cloud deployments. Though alternatives like Docker Swarm or Compose are a few years old now, Kubernetes provides a platform for building and managing data pipelines that can help ease the transition from local development to prod deployments.
Let’s start by setting up our local environment and required tools.
Kubernetes Dashboard
While this will be done by most cloud providers, its useful to initiate since running . Luckily kubernetes-helm has come a long way. To install helm, you can simply use brew install kubernetes-helm
followed by helm init
and then helm repo update
. While its also possible to just use dashboard.yaml config from https://github.com/kubernetes/dashboard using helm allows us to not have to be bothered by the nuances for docker-for-desktop system helm install stable/kubernetes-dashboard
This might seem trivial but the less kubernetes config files to worry about the better from my experience (more so when you are using preconfigured services). Your kubernetes config files should be looked over and understood by you personally and a majority of the basic apps you may wish to use will have preconfigured helm packages which makes the kubernetes ecosystem much more straightforward while still allowing a considerable amount of configuration from helm.
Sidenote: Running kubernetes locally
It is unlikely the directions for the official docs will work for you when running a local kubernetes cluster. To fix this, its necessary to create a tiller
service account and a ClusterRoleBinding
which grants permission at the cluster level and in all namespaces:
kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller
This will be put into the kubernetes config yaml for clarity and script initiating purposes in the future. However, including these steps in a configuration YAML file or setup script will enhance clarity and reproducibility for future deployments, even if they are not required for every kubernetes application.
With that done, use helm to create the dashboard for the cluster: helm install stable/kubernetes-dashboard
View the dashboard
Since helm allows both simple installs and the use of config files (which we did not use), the install will have explicit directions to allow you to proxy to the dashboard upon installation. For fish it is just
export POD_NAME=(kubectl get pods -n default -l "app=kubernetes-dashboard,release=interested-marsupial" -o jsonpath="{.items[0].metadata.name}")
kubectl -n default port-forward $POD_NAME 8443:8443
at which point you will be able to reach the dashboard (which you would otherwise see in some cloud provider by default) at localhost:8443
Side-note: when doing this sort of development system, it becomes useful to start writing kubernetes/docker functions in your shell since. I have a ton I’ve started using but an example for this use case in fish shell would be:
function kube_portforward
set -l DEPLOYMNET_NAME $argv
set -l POD_NAME (kubectl get pods -n default -l "app=$DEPLOYMNET_NAME" -o jsonpath="{.items[0].metadata.name}")
set -l POD_PORT (kubectl get pods -n default -l "app=$DEPLOYMNET_NAME" -o jsonpath="{.items[0].spec.containers[0].ports[0].containerPort}")
echo "Forwarding to https://localhost:$POD_PORT"
kubectl -n default port-forward $POD_NAME $POD_PORT
end
Installing Postgres
PostgreSQL serves as our data storage foundation in this Kubernetes setup. While storing data in PostgreSQL might seem unnecessary for smaller datasets that could be processed directly from CSV files, using a database makes sense in our Kubernetes environment. It allows us to create a standardized data pipeline that can handle both initial data ingestion and subsequent model training workflows.
The installation is straightforward using Helm:
helm install -f k8s/helm/postgres_values.yaml stable/postgresql
Infrastructure Ready for Machine Learning
With PostgreSQL running in our Kubernetes cluster, we now have the basic infrastructure needed for our machine learning pipeline. This setup enables us to:
- Ingest new datasets through Kubernetes jobs
- Train models using transfer learning techniques
- Deploy specialized models based on our general-purpose pretrained model
- Configure and tune models for specific use cases
A later post will cover how to implement these capabilities, starting with a simple example that demonstrates the core workflow.
Data Ingesting
The data itself is quite simple, and will look contain something along the likes of these columns:
- 'call_id'
- 'status'
- 'first_name'
- 'id'
- 'content'
- 'filename'
- 'outcome_ids'
- 'occurred_at'
- 'call_type_name'
- 'call_length_sec'
- 'is_customer'
- 'start_time'
- 'end_time'
- 'tag'
- 'prices'
- 'category_label'
- 'rep_group_type'
- 'ranker'
- 'call_start_time'
- 'nchars'
- 'nwords
Example of data to ingest
An example of what this will look like (not all rows included):
first_name | id | content | occurred_at | call_type_name | is_customer | start_time | end_time | tag | rep_group_type | ranker |
---|---|---|---|---|---|---|---|---|---|---|
Henry D… | 7525982 | No. We just bought some more property. And about to be soon renovation on the house […] look like, […] the vacation this year. | 2018-01-10 13:25:38 | subscription | t | 66120 | 79170.0 | outbound call | Middle Reps | 14 |
Henry D… | 7525983 | Okay. I got you. | 2018-01-10 13:25:38 | subscription | f | 72360 | 79170.0 | outbound call | Middle Reps | 15 |
Henry D… | 7525984 | We’re doing for the whole month on the floor and until we got a… | 2018-01-10 13:25:38 | subscription | t | 72360 | 79170.0 | outbound call | Middle Reps | 16 |
Henry D… | 7525985 | Wow. Wow. | 2018-01-10 13:25:38 | subscription | f | 79170 | 83790.0 | outbound call | Middle Reps | 17 |
Henry D… | 7525986 | Yeah. | 2018-01-10 13:25:38 | subscription | t | 79170 | 83790.0 | outbound call | Middle Reps | 18 |
The data in our example comes pre-cleaned, which allows us to focus on the pipeline architecture rather than data preprocessing. In the next section, we’ll explore how to build a complete data ingestion pipeline and integrate it with our machine learning model.
While there are many ways to ingest data into PostgreSQL, we’ll focus on handling CSV files for this example. Our goal is to create a production-ready system that can:
- Load data through Kubernetes jobs
- Trigger model training automatically
- Handle deployment of trained models
For data ingestion, we’ll use pgfutter
, a simple but powerful tool that handles CSV imports with minimal configuration:
pgfutter csv dataset.csv
This approach might seem overengineered for small CSV files, but it provides several advantages:
- Standardized data storage across different data types and sources
- Built-in monitoring through the Kubernetes dashboard
- Easy integration with additional services and pipelines
- Scalability from local development to cloud deployment
Why are you using postgres?
Why PostgreSQL for CSV Data?
While processing CSV files directly might seem simpler for this machine learning model, using PostgreSQL provides important advantages for building a scalable data pipeline. PostgreSQL serves as a standardized data store that can handle multiple input sources - from CSV files to real-time data streams. For example, we could extend our pipeline to ingest data from phone calls using Google’s Speech-to-Text API, with each transcription automatically flowing into our database through a dedicated Kubernetes job.
For our current CSV ingestion task, we’ll use pgfutter, a lightweight Go tool designed for efficient data imports. pgfutter simplifies the ETL process by handling common data formatting issues and providing a consistent interface for loading data into PostgreSQL. As a compiled Go executable, it’s particularly well-suited for containerized environments and can be easily integrated into Kubernetes jobs.
To implement this in Kubernetes, we’ll create a job definition that combines a ConfigMap for settings and a job specification for the actual ingestion process. This approach makes our pipeline configurable and reusable - a foundation for future expansion into a full Helm chart. For this sort of job we will want to use a kubernetes job coupled with a config file that is kicked off with the idea that it would be able to be much more translatable to a helm chart or template in the future
for example the configmap would be:
apiVersion: v1
kind: ConfigMap
metadata:
name: config-csv-ingest
namespace: default
data:
PGFUTTER_REPO: "github.com/lukasmartinelli/pgfutter"
DB_HOST: postgres
DB_NAME: postgres
DB_USER: postgres
DB_PASS: password
CSV_LOCATION: "/data.csv"
DATA_URL: "https://storage.googleapis.com/neural-niche/raw_text/csv/898f0e56-6a36-491a-9c43-3fed49d6a3d0/calldata.csv"
which allows us to use the env’s although things like DATA_URL would be passed in when job creation is started.
apiVersion: batch/v1
kind: Job
metadata:
name: ingest
spec:
template:
metadata:
name: ingest
labels:
datatype: csv
spec:
containers:
- name: pgfutter
image: golang:latest
envFrom:
- configMapRef:
name: config-csv-ingest
command: ["/bin/sh", "-c"]
args:
- go get $(PGFUTTER);
wget -O $(CSV_LOCATION) $(DATA_URL);
pgfutter csv -d \",\" $(CSV_LOCATION);
restartPolicy: Never
This example shows how Kubernetes jobs can streamline data ingestion. Though simple, it lays the foundation for automating workflows like model training and deployment, forming a scalable data pipeline.