Reading and Hacking Python's multiprocessing.managers: Part 3

In Part 1 and Part 2 of this series, we have gained basic understanding of the standard module multiprocessing.managers and fixed a few bugs in it. In this article, we will move on to make some enhancements to it. The complete code of these enhancements is available in the module mpservice.multiprocessing.server_process in the package mpservice.

May 29, 2024 Read More

Reading and Hacking Python's multiprocessing.managers: Part 2

In Part 1 of this series, we have seen a large portion of the functionalities that multiprocessing.managers has to offer. In this article, we will dive into some details, especially around memory management, to gain more understanding. Along the way, we are going to spot a few bugs or flaws in the standard library’s implementation, and propose fixes.

May 25, 2024 Read More

Reading and Hacking Python's multiprocessing.managers: Part 1

There are several ways to communicate between Python processes (as created by the standard package multiprocessing). One common and hugely useful way is by queues. This is an async, passive mechanism, in the sense that a process (the “receiver”) waits on a queue and gets whatever a “sender” (in another process) has placed in the queue. The receiver has no control over what to get and when to get—it gets whatever arrives when it arrives.

Async communication is like mails, whereas sync communication is like phone calls. multiprocessing.managers provides ways to make phone calls between processes.

May 15, 2024 Read More

Guaranteed Finalization Without Context Manager

In the last post, I said “I’m a little nervous about daemon threads,” so I used context manager to ensure a background thread is properly closed. However, in the same post, there was indeed a situation where context manager is not a good tool.

September 11, 2022 Read More

A few Convenience Utilities for Python Multiprocessing

In my use of Python’s multiprocessing module, from time to time I’m bothered by the lack of two convenience features, one concerning exception visibility, and the other concerning logging. I’ve created a subclass of Process to add these features.

September 4, 2022 Read More

An Efficient Algorithm for "Connected Components"

A while back I was debugging some legacy code. Profiling revealed that a few lines of code that computes “connected components” was a major bottleneck.

February 6, 2022 Read More

A Docker Stack for Personal and Team Projects in Python --- Part 3

Part 1 and Part 2 of this mini-series have gaven an overview of my Docker stack for Python projects and examined the image building process. This post will dive into using the image or, in other words, running the Docker image(s) during development and production.

May 8, 2021 Read More

A Docker Stack for Personal and Team Projects in Python --- Part 2

Part 1 of this mini-series gave an overview of my Docker stack for Python projects. In this post, I’ll dive into the image-building part of the stack. This part is mostly specific to Python. In fact, image-building has a lot to do with good practices regarding code structure of Python projects and workflows of Python package development and installation. I will use this opportunity to have some discussions on these matters.

May 1, 2021 Read More

A Docker Stack for Personal and Team Projects in Python --- Part 1

I started using Docker in early 2016. After learning it for a month or two, I have never done code development outside of Docker again. My Docker workflow has evolved over time. The main drivers have been designing for teamwork. So far I have had three major design rounds, which happened in early 2018 (for one team), early 2019 (for another team), and late 2020 (major simplifying overhaul). By now, I feel the stack has reached a relatively stable and good stage (as I have felt previously!), so I decided to write down the main ideas of it.

April 25, 2021 Read More

Simple Stream Processing With I/O Operations

Suppose we need to process a data stream in a sequence of steps (or “operations” or “functions”). If all the steps are CPU-bound, then we just chain them up. Each data item goes through the CPU-bound functions one by one; the CPU is fully utilized. Now, if one or more of the steps involve I/O, typical examples being disk I/O or http service calls, things get interesting.

October 18, 2020 Read More

Service Batching from Scratch, Again

In a previous post, I described an approach to serving machine learning models with built-in “batching”. The design has been used in real work with good results. Over time, however, I have observed some pain points and learned some new tricks. Finally, I got a chance to sit down and make an overhaul to it from scratch.

September 27, 2020 Read More

How to Connect to Localhost from within a Docker Container

I have Docker container A running a server, and container B running a client. At least for testing, both containers run on the same machine (host). The client software needs to reach out of its own container and then into the server container.

May 3, 2020 Read More

Service Batching from Scratch

TensorFlow Serving has “batching” capability because the model inference can be “vectorized”. This means that if you let it predict y for x, versus one hundred ys for one hundred xs, the latter may not be much slower than the former, because the algorithm takes the batch and performs fast matrix computation, or even sends the job to GPU for very fast parallel computation. The effect is that the batch may take somewhat longer than a single element, but per-item time is much shorter, hence throughput is much higher.

April 17, 2020 Read More

Biglist for Single-Machine, Out-of-Memory Long Sequence

When dealing with large amounts of data, a common situation is that the amount of data is manageable on a single machine, but can be unwieldy to be loaded at once into memory, for example, when the data size on disk is above 10GB and growing.

April 5, 2020 Read More

Integrating Hive UDF's in Python

I need to do some pretty flexible things in my Hive queries, so flexible that it’s beyond the capability of Hive QL. Writing a Hive UDF (user defined function) is an option. However, all the online examples I could find require the UDF to be a standing-alone script, placed at a known location in HDFS, and used via the ADD FILE statement that is understood by the Hive CLI. Having to put the script in HDFS and use it on a machine that has the Hive CLI installed means interruption to my Python code flow, which I hate. I want the Hive UDF to be seamlessly integrated into my Python code. How can I do that?

October 26, 2019 Read More

Some Resources for Learning Python

Here I’m not trying to make a comprehensive list. I just want to suggest a manageable list of really good resources.

February 24, 2019 Read More

Speeding up Python with C++ and pybind11

In two previous posts, I described using cython and C, respectively, to speed up a Python program that involves simple numerical computations. In this post, I will use the “big brother” of the industrial languages—C++—to speed up the same program.

December 15, 2018 Read More

Speeding up Python with C and cffi

In an appetizer served back in April, 2017, I demonstrated using Cython to achieve 100+ times speed-up to numerical bottlenecks in Python code. I have not used Cython in actual projects since then. Instead, I have had some experiences using C++ to extend Python. Compared with Cython, which is deeply intertwined with Python, using a separate language to write Python extensions has advantages, such as

December 1, 2018 Read More

Understanding Word Embedding by word2vec

I read a few sources explaining “word embedding”, especially as carried out by the word2vec algorithm. I felt there is still some intuitive clarity to be desired on a high level. Here I’ll attempt to describe it in my own words.

November 17, 2018 Read More

Understanding Gradient Boosting Tree for Binary Classification

I did some reading and thinking about Gradient Boosting Machine (GBM), especially for binary classification, and cleared up some confusion in my mind.

November 9, 2018 Read More

"Insert Overwrite Into Table" with Amazon Athena

For a long time, Amazon Athena does not support INSERT or CTAS (Create Table As Select) statements. To be sure, the results of a query are automatically saved. But the saved files are always in CSV format, and in obscure locations.

October 14, 2018 Read More

A Poor Man's Continuous Deployment System Using Docker

I’ve implemented a “continuous deployment” (CD) system centered on Docker.

September 30, 2018 Read More

Talking to Spark from Python via Livy

In order to use Spark in my self-sufficient Docker containers without worrying about access to a Spark client environment (to use spark-submit, for example), I found the Apache Livy project. Livy provides a REST service for interacting with a Spark cluster.

September 8, 2018 Read More

Sending Pipeline Alerts from Python to Slack

For recurrent pipelines, it is a common requirement to send notifications, or alerts, especially when error occurs.

August 12, 2018 Read More

Python, C++, Pybind11, and string_view

The other day I was using the excellent pybind11 to bridge some Python code and C++ code. The C++ code was performance critical, hence I used string_view (standardized in C++17) to avoid copying wherever possible.

January 29, 2018 Read More

Overloading `operator==` for a C++ Class Hierarchy

Recently I needed to define the equality operator between objects in a class hierarchy. A quick search revealed some discussions on this topic, and the opinion appears to be that this, while certainly doable, does not have a clean, elegant solution.

January 27, 2018 Read More

Scoping and Visibility of Environment Variables

I was considering using environment variables to do some simple configuration management. To this end I experimented with ways to set environment variables and access them in a (Python) program.

December 3, 2017 Read More

Simple Rotating Log Capture

This post concerns simple text-dump logs, not “data logs” that are sent to, say, Kafka for structured treatment.

November 19, 2017 Read More

Setting up Mac for Software Development

With a brand new Mac laptop, I usually do the following set-up in prep for programming work.

October 23, 2017 Read More

On the Bias-Variance Tradeoff

Yes, we have all seen the nice decomposition formula

  mean_squared_error = squared_bias + variance + irreducible_error

However, I’ve been puzzled by explanations of the tradeoff based on this formula, because it’s never convincingly clear that the mean_squared_error and irreducible_error terms are fixed as we explore different models!

October 10, 2017 Read More

Speeding up Python 147x with Cython

When evaluating Python for enterprise projects, the concern over its ultimate speed arises from time to time.

April 12, 2017 Read More

Embedding Python in C++, Part 3

In this post I will explore passing STL containers to Python in a variety of ways. The focus is to sort out how to pass containers by value and by reference.

February 12, 2017 Read More

Embedding Python in C++, Part 2

In my search for an alternative to raw Python/C API for embedding Python in C++, I had several requirements:

February 11, 2017 Read More

Embedding Python in C++, Part 1

Python has very good inter-operability with C and C++. (In this post I’ll just say C++ for simplicity.) There are two sides to this “inter-op”.

February 10, 2017 Read More

Some High-level Tips for Python Projects

Below are tips on a few very high-level and commonly-encountered topics in Python software development.

September 17, 2016 Read More

Should we use Scala or Python?

Data team,

With the prospect of starting to use Spark seriously, people are saying “(it’s time to) learn Scala”. I’m very worried about the data team become split between languages like our platform team is.

August 9, 2016 Read More

A Simply Successful Git Branching Model

I’m designing this Git workflow based on the well known A successful Git branching model of Vincent Driessen (known as the gitflow), but somewhat simpler, hence the title.

July 21, 2016 Read More

Why I Prefer Python to R for Data Work

I first encountered S-PLUS (a commercial distribution of S—R’s parent) in 2000, and used it for statistics coursework for a few years. From 2004 through 2011, I used R daily and intensively for implementing my research on statistical methodologies.

April 20, 2016 Read More