Simple Stream Processing With I/O Operations

Suppose we need to process a data stream in a sequence of steps (or “operations” or “functions”). If all the steps are CPU-bound, then we just chain them up. Each data item goes through the CPU-bound functions one by one; the CPU is fully utilized. Now, if one or more of the steps involve I/O, typical examples being disk I/O or http service calls, things get interesting.

October 18, 2020   Read More

Service Batching from Scratch, Again

In a previous post, I described an approach to serving machine learning models with built-in “batching”. The design has been used in real work with good results. Over time, however, I have observed some pain points and learned some new tricks. Finally, I got a chance to sit down and make an overhaul to it from scratch.

September 27, 2020   Read More

Service Batching from Scratch

TensorFlow Serving has “batching” capability because the model inference can be “vectorized”. This means that if you let it predict y for x, versus one hundred ys for one hundred xs, the latter may not be much slower than the former, because the algorithm takes the batch and performs fast matrix computation, or even sends the job to GPU for very fast parallel computation. The effect is that the batch may take somewhat longer than a single element, but per-item time is much shorter, hence throughput is much higher.

April 17, 2020   Read More

Biglist for Single-Machine, Out-of-Memory Long Sequence

When dealing with large amounts of data, a common situation is that the amount of data is manageable on a single machine, but can be unwieldy to be loaded at once into memory, for example, when the data size on disk is above 10GB and growing.

April 5, 2020   Read More

Integrating Hive UDF's in Python

I need to do some pretty flexible things in my Hive queries, so flexible that it’s beyond the capability of Hive QL. Writing a Hive UDF (user defined function) is an option. However, all the online examples I could find require the UDF to be a standing-alone script, placed at a known location in HDFS, and used via the ADD FILE statement that is understood by the Hive CLI. Having to put the script in HDFS and use it on a machine that has the Hive CLI installed means interruption to my Python code flow, which I hate. I want the Hive UDF to be seamlessly integrated into my Python code. How can I do that?

October 26, 2019   Read More

Speeding up Python with C++ and pybind11

In two previous posts, I described using cython and C, respectively, to speed up a Python program that involves simple numerical computations. In this post, I will use the “big brother” of the industrial languages—C++—to speed up the same program.

December 15, 2018   Read More

Speeding up Python with C and cffi

In an appetizer served back in April, 2017, I demonstrated using Cython to achieve 100+ times speed-up to numerical bottlenecks in Python code. I have not used Cython in actual projects since then. Instead, I have had some experiences using C++ to extend Python. Compared with Cython, which is deeply intertwined with Python, using a separate language to write Python extensions has advantages, such as

December 1, 2018   Read More

Understanding Word Embedding by word2vec

I read a few sources explaining “word embedding”, especially as carried out by the word2vec algorithm. I felt there is still some intuitive clarity to be desired on a high level. Here I’ll attempt to describe it in my own words.

November 17, 2018   Read More

"Insert Overwrite Into Table" with Amazon Athena

For a long time, Amazon Athena does not support INSERT or CTAS (Create Table As Select) statements. To be sure, the results of a query are automatically saved. But the saved files are always in CSV format, and in obscure locations.

October 14, 2018   Read More

Talking to Spark from Python via Livy

In order to use Spark in my self-sufficient Docker containers without worrying about access to a Spark client environment (to use spark-submit, for example), I found the Apache Livy project. Livy provides a REST service for interacting with a Spark cluster.

September 8, 2018   Read More

Python, C++, Pybind11, and string_view

The other day I was using the excellent pybind11 to bridge some Python code and C++ code. The C++ code was performance critical, hence I used string_view (standardized in C++17) to avoid copying wherever possible.

January 29, 2018   Read More

Overloading `operator==` for a C++ Class Hierarchy

Recently I needed to define the equality operator between objects in a class hierarchy. A quick search revealed some discussions on this topic, and the opinion appears to be that this, while certainly doable, does not have a clean, elegant solution.

January 27, 2018   Read More

On the Bias-Variance Tradeoff

Yes, we have all seen the nice decomposition formula

  mean_squared_error = squared_bias + variance + irreducible_error

However, I’ve been puzzled by explanations of the tradeoff based on this formula, because it’s never convincingly clear that the mean_squared_error and irreducible_error terms are fixed as we explore different models!

October 10, 2017   Read More

Embedding Python in C++, Part 3

In this post I will explore passing STL containers to Python in a variety of ways. The focus is to sort out how to pass containers by value and by reference.

February 12, 2017   Read More

Embedding Python in C++, Part 1

Python has very good inter-operability with C and C++. (In this post I’ll just say C++ for simplicity.) There are two sides to this “inter-op”.

February 10, 2017   Read More

Should we use Scala or Python?

Data team,

With the prospect of starting to use Spark seriously, people are saying “(it’s time to) learn Scala”. I’m very worried about the data team become split between languages like our platform team is.

August 9, 2016   Read More

Why I Prefer Python to R for Data Work

I first encountered S-PLUS (a commercial distribution of S—R’s parent) in 2000, and used it for statistics coursework for a few years. From 2004 through 2011, I used R daily and intensively for implementing my research on statistical methodologies.

April 20, 2016   Read More