using kubernete jobs for one off ingestion of csv's

June 10, 2018

Running Postgres on kubernetes locally

While setting up Kubernetes locally might seem like overkill for one off data ingestion tasks, it provides several advantages:

Creates a consistent development environment that mirrors production
Allows testing of Kubernetes configurations before cloud deployment
Enables development of microservices in isolation
Provides a foundation for scaling your ML pipeline

For this tutorial, we’ll use Docker for Mac with its built-in Kubernetes support (v1.9.8). This setup offers a straightforward development experience with modern Kubernetes tooling while maintaining compatibility with cloud deployments. Though alternatives like Docker Swarm or Compose are a few years old now, Kubernetes provides a platform for building and managing data pipelines that can help ease the transition from local development to prod deployments.

Let’s start by setting up our local environment and required tools.

Kubernetes Dashboard

While this will be done by most cloud providers, its useful to initiate since running . Luckily kubernetes-helm has come a long way. To install helm, you can simply use brew install kubernetes-helm followed by helm init and then helm repo update. While its also possible to just use dashboard.yaml config from https://github.com/kubernetes/dashboard using helm allows us to not have to be bothered by the nuances for docker-for-desktop system helm install stable/kubernetes-dashboard

This might seem trivial but the less kubernetes config files to worry about the better from my experience (more so when you are using preconfigured services). Your kubernetes config files should be looked over and understood by you personally and a majority of the basic apps you may wish to use will have preconfigured helm packages which makes the kubernetes ecosystem much more straightforward while still allowing a considerable amount of configuration from helm.

Sidenote: Running kubernetes locally

It is unlikely the directions for the official docs will work for you when running a local kubernetes cluster. To fix this, its necessary to create a tiller service account and a ClusterRoleBinding which grants permission at the cluster level and in all namespaces:

kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller

This will be put into the kubernetes config yaml for clarity and script initiating purposes in the future. However, including these steps in a configuration YAML file or setup script will enhance clarity and reproducibility for future deployments, even if they are not required for every kubernetes application.

With that done, use helm to create the dashboard for the cluster: helm install stable/kubernetes-dashboard

View the dashboard

Since helm allows both simple installs and the use of config files (which we did not use), the install will have explicit directions to allow you to proxy to the dashboard upon installation. For fish it is just

export POD_NAME=(kubectl get pods -n default -l "app=kubernetes-dashboard,release=interested-marsupial" -o jsonpath="{.items[0].metadata.name}")
kubectl -n default port-forward $POD_NAME 8443:8443

at which point you will be able to reach the dashboard (which you would otherwise see in some cloud provider by default) at localhost:8443

Side-note: when doing this sort of development system, it becomes useful to start writing kubernetes/docker functions in your shell since. I have a ton I’ve started using but an example for this use case in fish shell would be:

function kube_portforward
    set -l DEPLOYMNET_NAME $argv
    set -l POD_NAME (kubectl get pods -n default -l "app=$DEPLOYMNET_NAME" -o jsonpath="{.items[0].metadata.name}")
    set -l POD_PORT (kubectl get pods -n default -l "app=$DEPLOYMNET_NAME" -o jsonpath="{.items[0].spec.containers[0].ports[0].containerPort}")
    echo "Forwarding to https://localhost:$POD_PORT"
    kubectl -n default port-forward $POD_NAME $POD_PORT
end

Installing Postgres

PostgreSQL serves as our data storage foundation in this Kubernetes setup. While storing data in PostgreSQL might seem unnecessary for smaller datasets that could be processed directly from CSV files, using a database makes sense in our Kubernetes environment. It allows us to create a standardized data pipeline that can handle both initial data ingestion and subsequent model training workflows.

The installation is straightforward using Helm:

helm install -f k8s/helm/postgres_values.yaml stable/postgresql

Infrastructure Ready for Machine Learning

With PostgreSQL running in our Kubernetes cluster, we now have the basic infrastructure needed for our machine learning pipeline. This setup enables us to:

Ingest new datasets through Kubernetes jobs
Train models using transfer learning techniques
Deploy specialized models based on our general-purpose pretrained model
Configure and tune models for specific use cases

A later post will cover how to implement these capabilities, starting with a simple example that demonstrates the core workflow.

Data Ingesting

The data itself is quite simple, and will look contain something along the likes of these columns:

- 'call_id'
- 'status'
- 'first_name'
- 'id'
- 'content'
- 'filename'
- 'outcome_ids'
- 'occurred_at'
- 'call_type_name'
- 'call_length_sec'
- 'is_customer'
- 'start_time'
- 'end_time'
- 'tag'
- 'prices'
- 'category_label'
- 'rep_group_type'
- 'ranker'
- 'call_start_time'
- 'nchars'
- 'nwords

Example of data to ingest

An example of what this will look like (not all rows included):

first_name	id	content	occurred_at	call_type_name	is_customer	start_time	end_time	tag	rep_group_type	ranker
Henry D…	7525982	No. We just bought some more property. And about to be soon renovation on the house […] look like, […] the vacation this year.	2018-01-10 13:25:38	subscription	t	66120	79170.0	outbound call	Middle Reps	14
Henry D…	7525983	Okay. I got you.	2018-01-10 13:25:38	subscription	f	72360	79170.0	outbound call	Middle Reps	15
Henry D…	7525984	We’re doing for the whole month on the floor and until we got a…	2018-01-10 13:25:38	subscription	t	72360	79170.0	outbound call	Middle Reps	16
Henry D…	7525985	Wow. Wow.	2018-01-10 13:25:38	subscription	f	79170	83790.0	outbound call	Middle Reps	17
Henry D…	7525986	Yeah.	2018-01-10 13:25:38	subscription	t	79170	83790.0	outbound call	Middle Reps	18

The data in our example comes pre-cleaned, which allows us to focus on the pipeline architecture rather than data preprocessing. In the next section, we’ll explore how to build a complete data ingestion pipeline and integrate it with our machine learning model.

While there are many ways to ingest data into PostgreSQL, we’ll focus on handling CSV files for this example. Our goal is to create a production-ready system that can:

Load data through Kubernetes jobs
Trigger model training automatically
Handle deployment of trained models

For data ingestion, we’ll use pgfutter, a simple but powerful tool that handles CSV imports with minimal configuration:

pgfutter csv dataset.csv

This approach might seem overengineered for small CSV files, but it provides several advantages:

Standardized data storage across different data types and sources
Built-in monitoring through the Kubernetes dashboard
Easy integration with additional services and pipelines
Scalability from local development to cloud deployment

Why are you using postgres?

Why PostgreSQL for CSV Data?

While processing CSV files directly might seem simpler for this machine learning model, using PostgreSQL provides important advantages for building a scalable data pipeline. PostgreSQL serves as a standardized data store that can handle multiple input sources - from CSV files to real-time data streams. For example, we could extend our pipeline to ingest data from phone calls using Google’s Speech-to-Text API, with each transcription automatically flowing into our database through a dedicated Kubernetes job.

For our current CSV ingestion task, we’ll use pgfutter, a lightweight Go tool designed for efficient data imports. pgfutter simplifies the ETL process by handling common data formatting issues and providing a consistent interface for loading data into PostgreSQL. As a compiled Go executable, it’s particularly well-suited for containerized environments and can be easily integrated into Kubernetes jobs.

To implement this in Kubernetes, we’ll create a job definition that combines a ConfigMap for settings and a job specification for the actual ingestion process. This approach makes our pipeline configurable and reusable - a foundation for future expansion into a full Helm chart. For this sort of job we will want to use a kubernetes job coupled with a config file that is kicked off with the idea that it would be able to be much more translatable to a helm chart or template in the future

for example the configmap would be:

apiVersion: v1
kind: ConfigMap
metadata:
    name: config-csv-ingest
    namespace: default
data:
    PGFUTTER_REPO: "github.com/lukasmartinelli/pgfutter"
    DB_HOST: postgres
    DB_NAME: postgres
    DB_USER: postgres
    DB_PASS: password
    CSV_LOCATION: "/data.csv"
    DATA_URL: "https://storage.googleapis.com/neural-niche/raw_text/csv/898f0e56-6a36-491a-9c43-3fed49d6a3d0/calldata.csv"

which allows us to use the env’s although things like DATA_URL would be passed in when job creation is started.

apiVersion: batch/v1
kind: Job
metadata:
    name: ingest
spec:
  template:
    metadata:
      name: ingest
      labels:
        datatype: csv
    spec:
      containers:
      - name: pgfutter
        image: golang:latest
        envFrom:
         - configMapRef:
             name: config-csv-ingest
        command: ["/bin/sh", "-c"]
        args:
        - go get $(PGFUTTER);
          wget -O $(CSV_LOCATION) $(DATA_URL);
          pgfutter csv -d \",\" $(CSV_LOCATION);
      restartPolicy: Never

This example shows how Kubernetes jobs can streamline data ingestion. Though simple, it lays the foundation for automating workflows like model training and deployment, forming a scalable data pipeline.