Suppose we need to process a data stream in a sequence of steps (or “operations” or “functions”). If all the steps are CPU-bound, then we just chain them up. Each data item goes through the CPU-bound functions one by one; the CPU is fully utilized. Now, if one or more of the steps involve I/O, typical examples being disk I/O or http service calls, things get interesting.
In a previous post, I described an approach to serving machine learning models with built-in “batching”. The design has been used in real work with good results. Over time, however, I have observed some pain points and learned some new tricks. Finally, I got a chance to sit down and make an overhaul to it from scratch.
I have Docker container A running a server, and container B running a client. At least for testing, both containers run on the same machine (host). The client software needs to reach out of its own container and then into the server container.
TensorFlow Serving has “batching” capability because the model inference can be “vectorized”.
This means that if you let it predict
x, versus one hundred
ys for one hundred
xs, the latter may not be much slower than the former,
because the algorithm takes the batch and performs fast matrix computation, or even sends the job to GPU for very fast parallel computation.
The effect is that the batch may take somewhat longer than a single element,
but per-item time is much shorter, hence throughput is much higher.
When dealing with large amounts of data, a common situation is that the amount of data is manageable on a single machine, but can be unwieldy to be loaded at once into memory, for example, when the data size on disk is above 10GB and growing.
I need to do some pretty flexible things in my Hive queries, so flexible
that it’s beyond the capability of Hive QL.
Writing a Hive UDF (user defined function) is an option.
However, all the online examples I could find require the UDF to be a standing-alone script,
placed at a known location in HDFS, and used via the
ADD FILE statement that is understood by the Hive CLI.
Having to put the script in HDFS and use it on a machine that has the Hive CLI installed means interruption to my Python code flow, which I hate.
I want the Hive UDF to be seamlessly integrated into my Python code.
How can I do that?
Here I’m not trying to make a comprehensive list. I just want to suggest a manageable list of really good resources.
In two previous posts, I described using cython and C, respectively, to speed up a Python program that involves simple numerical computations. In this post, I will use the “big brother” of the industrial languages—C++—to speed up the same program.
In an appetizer served back in April, 2017, I demonstrated using
Cython to achieve 100+ times speed-up to numerical bottlenecks in Python code. I have not used Cython in actual projects since then. Instead, I have had some experiences using
C++ to extend Python. Compared with Cython, which is deeply intertwined with Python, using a separate language to write Python extensions has advantages, such as
I read a few sources explaining “word embedding”, especially as carried out by the
I felt there is still some intuitive clarity to be desired on a high level.
Here I’ll attempt to describe it in my own words.
I did some reading and thinking about Gradient Boosting Machine (GBM), especially for binary classification, and cleared up some confusion in my mind.
For a long time, Amazon Athena does not support
INSERT or CTAS (
Create Table As Select) statements.
To be sure, the results of a query are automatically saved.
But the saved files are always in CSV format, and in obscure locations.
I’ve implemented a “continuous deployment” (CD) system centered on Docker.
In order to use Spark in my self-sufficient Docker containers without worrying about access to a Spark client environment (to use
spark-submit, for example), I found the Apache Livy project. Livy provides a REST service for interacting with a Spark cluster.
For recurrent pipelines, it is a common requirement to send notifications, or alerts, especially when error occurs.
The other day I was using the excellent pybind11 to bridge some Python code and C++ code.
The C++ code was performance critical, hence I used
string_view (standardized in C++17) to avoid copying wherever possible.
Recently I needed to define the equality operator between objects in a class hierarchy. A quick search revealed some discussions on this topic, and the opinion appears to be that this, while certainly doable, does not have a clean, elegant solution.
I was considering using environment variables to do some simple configuration management. To this end I experimented with ways to set environment variables and access them in a (Python) program.
This post concerns simple text-dump logs, not “data logs” that are sent to, say, Kafka for structured treatment.
With a brand new Mac laptop, I usually do the following set-up in prep for programming work.
Yes, we have all seen the nice decomposition formula
mean_squared_error = squared_bias + variance + irreducible_error
However, I’ve been puzzled by explanations of the tradeoff based on this formula,
because it’s never convincingly clear that the
terms are fixed as we explore different models!
When evaluating Python for enterprise projects, the concern over its ultimate speed arises from time to time.
In this post I will explore passing STL containers to Python in a variety of ways. The focus is to sort out how to pass containers by value and by reference.
In my search for an alternative to raw Python/C API for embedding Python in C++, I had several requirements:
Python has very good inter-operability with C and C++. (In this post I’ll just say C++ for simplicity.) There are two sides to this “inter-op”.
Below are tips on a few very high-level and commonly-encountered topics in Python software development.
With the prospect of starting to use
Spark seriously, people are saying “(it’s time to) learn
I’m very worried about the data team become split between languages like our platform team is.
I’m designing this Git workflow based on the well known A successful Git branching model of Vincent Driessen (known as the gitflow), but somewhat simpler, hence the title.
I first encountered S-PLUS (a commercial distribution of S—R’s parent) in 2000, and used it for statistics coursework for a few years. From 2004 through 2011, I used R daily and intensively for implementing my research on statistical methodologies.