• Skip to primary navigation
  • Skip to main content

Algorithm.co.il

  • About
  • Best Posts
  • Origami
  • Older Projects

lorg

WhatsApp Top Posters

Posted on 2022-06-17 by lorg Leave a Comment

Whatsapp Logo

A few years ago I joined a CTO WhatsApp group. I think I was the 14th person there or so. These days this group has more than 270 members. It is an amazing group, with a lot of valuable content.

People have repeatedly asked for an easily searchable index of the group’s history, and at some point I started writing one. I spent a couple of hours parsing the WhatsApp group history, set up an Elasticsearch server on AWS and uploaded the messages there. I spent a bit more time trying to set up Cognito to work, but I ran out of time to spend on that project, and then the log4j vulnerabilities came and I killed that Elasticsearch server.

One nice side effect of the scripts I wrote was that I could get the list of top-posters in a WhatsApp group according to its history which is pretty useful when you’re group admin. I’ve already used it several times. Since I’ve been asked to run it by other group admins, I thought I’d share it openly. The conversation analysis code is still there but mostly unused.

So here it is: https://github.com/lorg/whatsapp_history . You are welcome to use it anyway you see fit.

Have fun, and let me know if you found it useful :)

Filed under: Programming

A taste of the secret sauce behind JustTalkTo

Posted on 2021-11-22 by lorg Leave a Comment

At JustTalkTo, we realized early on that the idea is not deep tech. There isn’t really a good justification for machine learning, deep learning, a kernel driver, or low-level optimizations. What we do have is a kick-ass product and the tech to back it up.

One of the more interesting pieces of tech that we built early on is our text mechanism, coupled with the feature flags mechanism. I’ll cover both of them in this short article, as I think they can be very helpful to other people as well.

Feature Flags

Feature flags are already a pretty familiar design pattern. The idea is that when you create new functionality, instead of throwing away the old behavior, you selectively enable or disable the new functionality based on a boolean flag that sits somewhere in your database.

This last bit is important – it must be in a place that is easily modifiable by someone outside the dev team, so just creating a constant in your header file will not cut it.

Also, another important thing is to decide on the scope of the feature flag – is it global to all of your users? Or is it user specific? The most common choice should be to make the feature flag user-specific.

Most new features that change the behavior of your product should be enabled or disabled with a feature flag. This does not include bug-fixes, or even all features! Essentially – if you think that some users might want the old behavior – just keep it around.

Once you have this in place – your approach to product development changes. It’s suddenly much easier to say yes to user requests that would normally impact all users – suddenly adding a feature has low risk to other users as you can selectively enable it. You can also enable it only for users who pay for it. You can decide to enable it based on the results of an A/B test. And the people responsible for making that call are not developers, instead it’s usually a combination of sales/marketing/product, trying to optimize for business value for your customers.

Customizable Texts

Some products require translation (AKA internationalization and localization, or i18n and l10n). Usually this is done with static lookup tables. Early on, I decided that we need something similar, but more powerful. In JustTalkTo, every string in our product comes from a dynamic lookup table that is manageable by non-developers. This table has the following fields:

  • Language code, e.g. he_IL
  • type, e.g. prompt, input, web-text
  • name, e.g. “didnt_understand”
  • text, e.g. “I didn’t understand your message”
  • description, e.g. “The response for sending a message that could not be parsed by our system”.

The table is initialized from code using a bit of declarative Python based on a class I wrote:

class Prompts(Texts[Text]):
    didnt_understand = Text("I didn't understand that",
                            "The response for sending a message that could not be parsed by our system")

After being initialized, the text value for a given language code, type and name is taken from the DB. This makes our product incredibly malleable, and with some clever editing it can be changed to fit many use cases. It took us just a few days to realize that the language code is not just about human languages such as English, Hebrew or Spanish, but rather, the languages our users need – so language codes now refer to particular use cases.

In addition, the text displayed can also be a template, so that relevant values (e.g. “user_name”) can be interpolated inside, and in some rare cases, even more complex logic.

Putting it all together

Right now at JustTalkTo, we consider ourselves to still be in the discovery phase – we are still exploring use cases and trying to find our product-market-fit. One approach we had early on was interviewing many users. Our CEO Shira did just that – she interviewed hundreds of them!

Another decision we made was to build what our users asked us to. If a user asked us to build some relevant feature – we’d do our best to build it.

BUT – we prioritized aggressively, and each new feature was built using our customizable texts and controlled by a feature flag. This meant that for the first users, when a new user would join, Shira would interview them and tailor the product to their needs using a combination of language and feature flags. Some users would even get their own language code!

Now, as our offering is starting to mature, we don’t interview all of our users, only some of them. We identify broader use-cases, and configure a set of feature flags and languages to fit these use cases, and automatically sign up new users to a “ready made package”. Our approach allowed us to effectively discover these use-cases and quickly support them for our new users to try them out, and to collect feedback and statistics on their usability, attractiveness and relevance to our success.

If you think this might be relevant to you as well do let me know in the comments, I’ll be happy to discuss practical considerations and answer questions about our implementation. In a coming article I’ll write a bit about automated testing and also cover how it touches this customizable approach.

Filed under: startup

5 useful lessons from writing an app for kids

Posted on 2020-08-07 by lorg Leave a Comment

It always starts with an itch. Today, there aren’t really many good apps to teach kids to read Hebrew. The best one is “Kesem’s Monsters” – and when I tried getting my son to play it, it annoyed me. He liked playing it – but I thought it wasn’t teaching reading all that well.

About a year ago, I was already finished the German skill tree in Duolingo – and I knew what I was looking for. There is an important difference between teaching an adult a second language, and teaching a child to read. While the problems are similar, the adult already knows how to read – it’s a basic skill that is generally transferable between many languages. The child however is normally a native speaker – they already have a good vocabulary – but they need to learn to read, a completely new skill.

So about a year ago, I started working on a reading app for my son. I wrote it in Swift for iOS, which was a fun experience learning a new statically and strongly typed language. The main game for the child was: see a written word – and tap the image it represents. So for the word “Horse” – the child would need to tap the horse icon, as opposed to the house icon.

Here is a demo of the app:

A demo of “Learn to read Hebrew easily”

The app is already available on the app store here. I also built a site for it. Following are 5 lessons I learned while working on this app.

1 – Your free time can make an impact

Working on a startup while having kids is draining. Not just your time – but your energy too. A few years back I decided not to feel guilty about not having the energy to work on a side project. It’s ok at the end of the day just to sit on the sofa and watch some Netflix with your significant other.

However, sometimes you do get that itch to build something. It builds up in you. And when that happens you will find some time. An hour when everyone is taking their afternoon nap on a Saturday. Another hour at night when everyone’s already sleeping.

And the reward in satisfaction is there. You can look at your time and say “I built this!”. It doesn’t have to be the next billion or even just million dollar startup. It can be something as small as a blog post. But it will have an impact.

With this app I knew I was really onto something when I got this feedback from a beta tester (translated from Hebrew):

“Thanks to you, a young first-grader who we just recently discovered also has a development disability (he is a foster child) is advancing his reading and enjoys using the app. Thank you!”

With this, I knew that even if I stopped there, and there won’t be a single additional installation of this app, I made some impact. Not every idea or project you and I will build will have this specific impact. But they will do something.

2 – How to motivate children to learn by playing

Humans are amazing reward optimizers, and children are even more so. Here are some learnings on what worked for me.

  • Provide visual and audio reward and encouragement for every success
  • Make it clear what’s an error and what isn’t
  • Don’t leave loopholes for getting “progress” without learning, children love exploiting those!
    • A bug I described here – my kid would click multiple times on a right answer to earn the points for it multiple times
    • For an image with alternative words you need to pick from – the child will read only the first syllable to find the correct word
  • The meta-game is critical!
    • In my app I added a reward screen where every 5 levels you get another animal added to this reward screen. This encouraged children to continue playing through the levels to get all the rewards.
  • Be careful with negative feedback
    • In an early version, my son would visibly lose points on making a mistake. He got really upset and stopped playing, which surprised me, I didn’t expect such an emotional connection to the points he earned.

3 – A single good experience is not good enough

From the people who installed the app, or tested it I got very good feedback. It’s a great feeling! However, many children played through all the levels once in about an hour, and then stopped playing.

Learning from that – the next major version of the app will have a different engagement model – where I would aim to get the child to play a little more each day. The rewards need to be built differently, and long term engagement needs to be the focus.

4 – Spend just a little bit of money and your work will look professional

Initial beta versions of the app had a white background and I recorded the narration myself. These days, getting some quality graphics and good narration does not cost a lot! You can get very good results for not a lot of money on fiverr.com or upwork.com.

The different response I saw from people before and after I put in these changes was surprising. Before – “cute toy”. After – “wow, you built this?”.

5 – It takes a lot of work to make an app a success

This app is not yet a success. I never expected it to be a great financial success. At most I hoped that some children, including my own, will improve their reading.

To make an app a success takes doing a lot of things, which are hard to do in your free time.

  • Build a good experience
  • Support both iOS and Android
  • Build a website
  • Build a facebook page and maintain it
  • Get screenshots and demo videos for the app store page
  • Write some good copy for the app store page
  • Promote your app with advertisements
  • Track users using analytics and encourage them to use/share/review your app
  • Implement some monetization – be it ads, purchase price of the app, in-app purchases, whatever – this takes work

Happily enough – these days it’s easier than ever to do these things cheaply and easily – but even then it’s still a lot of work.

When working on a side project, you should enter it with open eyes and be aware of what is the most you can expect from the level of investment you put in.

Filed under: Projects

Validating Flight Networks for Drones – part 2

Posted on 2020-07-25 by lorg Leave a Comment

In part-1 I described how we validate flight-networks at Flytrex to make sure that no two nodes of a flight network are too close. Now let’s turn our attention to validating that no two edges are too close.

First, how does one define “edges being too close”? What is the distance between edges?

Here’s a good definition – the distance between edges e1 and e2 is

D(e1, e2) = min D(p1, p2) for p1 ∈ e1 and p2 ∈ e2

That means, the minimal distances of all distances between pairs of points taken from e1 and e2 respectively.

Given that definition – if two edges cross, then of course the distance between them is zero. This is useful – two edges crossing is a more extreme case of two edges being too close to one another.

So how can we implement this? Unfortunately our “closest pair” algorithm for vertices from previous post is not good enough here. I was unsure what to do – and so the easiest solution like all lazy programmers – is going to stackoverflow. Note, just looking for an existing question and answer pair is not good enough – I didn’t find a solution. As I wrote previously – it’s better to not be shy and actually ask for help.

What did I learn? I saw a suggestion to use Rtree. What is that?
Rtree is a wrapper around libspatialindex, which is a library that provides you with, well, a spatial index. A spatial index is a data structure that indexes objects according to their position in space. Read here for some theoretical details.

Visualization of an R*-tree for 3D points using ELKI (the cubes are directory pages).
Taken from the R-tree page on wikipedia

So our approach to solving our problem is relatively straightforward. For each edge in our graph, we will add it to our spatial index. Then, for each edge we will look up the closest edges to it, and make sure that distance between our edge and the closest edges is still more than our MINIMAL_DISTANCE.

How many edges should we retrieve? Well, that depends. Obviously, if two edges share a vertex then the distance between them is zero, however, we do not want to alert on edges that share a vertex. So we need to get the N closest edges such that N > number of neighbour edges of our edge. For some simple graphs that is simple, but for the general case that might include all of the edges (e.g. a sun-shaped graph with a single vertex all edges share.)

Here is some sample code:

def _prepare(self):
    for idx, edge in enumerate(self.edges):
        self.node_to_edges[edge.from_vertex_id].add(idx)
        self.node_to_edges[edge.to_vertex_id].add(idx)
        self.tree.add(idx, self._get_bounds(idx))
def validate(self):
    for idx, edge in enumerate(self.edges):
        neighbor_edges = (
            self.node_to_edges[edge.from_vertex_id] |
            self.node_to_edges[edge.to_vertex_id]) - {idx}
        # The +10 there is because we want the few edges that are not
        # our neighbors. (The distance to our neighbors will always
        # be 0 and we should ignore them)
        nearest = self.tree.nearest(
            self._get_bounds(idx), len(neighbor_edges) + 10)
        for nearest_idx in nearest:
            if nearest_idx in neighbor_edges or nearest_idx <= idx:
                continue
            dist = self._get_distance(idx, nearest_idx)
            if dist < MIN_EDGE_DISTANCE_M:
                p1 = self.id_to_node[edge.from_vertex_id]
                p2 = self.id_to_node[edge.to_vertex_id]
                edge2 = self.edges[nearest_idx]
                q1 = self.id_to_node[edge2.from_vertex_id]
                q2 = self.id_to_node[edge2.to_vertex_id]
                raise ValidationError(
                    f"The edges {p1.name} -> {p2.name}" 
                    f"and {q1.name} -> {q2.name}" 
                    f"are too close to one another. "
                    f"Distance is {dist:.1f}m but should be" 
                    f"{MIN_EDGE_DISTANCE_M}m")

The result of this code worked well. It manages to find issues in flight networks, and does it in reasonable speed.

For myself, I’m happy to have added another tool to my arsenal – Rtree.

Filed under: Algorithms

How to hire a freelancer – 25 useful tips

Posted on 2020-07-19 by lorg Leave a Comment

Over the years I’ve had many opportunities to work with freelancers. Recently a friend had a bad experience looking for freelancers, and I tried to help her find new ones. That prompted me to write a bit about my method. While not completely foolproof, it will increase your chances of finding better freelancers.

Defining the job offer

Before we start, there are a few questions you need to ask yourself.

1) Is this for a job I know how to do myself, but just need someone else to do it, or is it for a job I don’t know how to do myself?

Sometimes, I need someone to do some programming for me. Sometimes it’s some programming that I know how to do but just need manpower, and sometimes it’s in an area that I’m really unfamiliar with, say, machine-learning image analysis. Sometimes it’s for something completely different – e.g. narration, or graphic design – where I really can’t do it myself.

2) For a job that I don’t know how to do myself, do I know how to evaluate the quality of the work?

Let’s say I’m looking for a narrator – I can certainly judge the quality of the work. My judgement might not be the best – I will probably miss some finer points, but I will still be able to see if someone did a good job and it sounds good to me. On the other hand, let’s say I’m looking for someone to port my system to Azure for me, and at best I’m familiar with AWS. They might be making major mistakes in design and I wouldn’t know, because I’m not familiar with Azure.

3) Do I know how to evaluate the quality of the freelancer?

Prior to hiring the freelancer and looking at the result of their work, how well can I predict how good of a job they will do? Of course, looking at past work is the most obvious thing, but sometimes that’s also hard. Many people can boast of impressive demo projects but still will not be able to do what I want them to. Sometimes, their past work is protected by NDAs, e.g. security advisors. In these cases I will usually need to rely on recommendations.

4) Do I have fixed requirements, or is this a project with changing requirements?

Sometimes the requirements are very clear – I need some narration done, a background drawn, some specific functionality implemented. Sometimes, I want a large feature developed, with the details not yet specified, and I need the first mockup version implemented to be able to be specific. Sometimes, I need a result “make my DB go faster” or “make my website secure” – and the particular actions to take are as yet unknown.

5) Is this a long term or a short term, one time engagement?

Sometimes you just need some small task done. Sometimes you are looking for someone to work with for the long term. If I’m looking for a short term and fixed price project, then the requirements must be well known beforehand, and I must be able to judge the quality of the result.
Long term projects can be opportunities for freelancers, so they provide you a way to get a better price or more leeway in changes requested.

6) What is my budget for this project?

That’s one of the most critical questions. You might have a total project budget – or just a monthly budget for work done. Having a tight budget might force you to be very strict about your plans and requirements. Having some experimentation budget might allow you to hire multiple freelancers and pick the best one.

Specific techniques

Once you are clear with yourself about the answers to all of these questions, many decisions will become much easier to make. Here are some techniques to handle various situations.

My requirements are around results and I don’t know what steps need to be done. Examples: “Make my code faster”, “Make my website secure”, “Propose a design to my website pretty”

  1. Make sure you know how to measure or evaluate the results. Specify clear criteria for success.
  2. Do you need just proposals for changes, or actual implementation? Actual implementation is better, unless you can clearly evaluate the proposed changes.
  3. When looking for freelancers, ask them what steps they will take to do their work, and what they expect the proposed changes are going to be. Then compare the results of multiple experts, this will allow you to evaluate who makes sense and who does not.
  4. Building on the previous step – if you had one freelancer suggest that he will do X and the other not suggest it – ask the first “Why did you propose X?” and the second “Why didn’t you propose X?”. For example, when optimizing a database, one freelancer can suggest she will “Set up a read replicate”. Ask the other one why he didn’t suggest it.

I have a lot of budget and I want to get the best freelancers

  1. It depends on what you mean by “a lot of budget” – but one easy way to get good freelancers is to hire multiple freelancers to do the job of just one, and pick the best result.
  2. A cheaper alternative, is to hire multiple freelancers to do a test task, and keep only the best. You won’t even need a lot of budget for that, and if the project is critical for you, there’s a very good chance it’s worth it.
  3. If you’re working with developers, and you are hiring multiple freelancers, consider having them code-review each other. This will increase overall quality and give you another opportunity to evaluate their work.

I am looking for a freelancer for a long-term engagement

  1. The first task must be an evaluation task, and this should be communicated to the freelancer.
  2. Similar to the previous situation – you can use the evaluation task to pick the best freelancer out of a group. You can have all freelancers do the same task, or give each one a different task, as you still get to keep the results even if you don’t continue working with them.

General advice

  1. Always communicate clearly what are your requirements, and what is the evaluation criteria you will be using. This applies to all communications; to the first message, and to your reply when the job is done.
  2. Don’t be afraid to disagree with the freelancer. Don’t be afraid to say that you want changes. Just be clear and upfront about your expectations, and keep to the terms you agreed.
  3. If you’re not an expert in the area – get a friend to give you some advice, especially if a lot of money is at stake.
  4. Don’t be afraid to add requirements and questions when you are selecting a freelancer. Worst case – they will decide not to work with you.
  5. When choosing which freelancer to work with, don’t forget to evaluate the freelancer on their communication ability. If someone doesn’t answer your job interview questions, or doesn’t understand them – how will they complete your requirements? How will they understand the urgent bug you’re trying to explain?
  6. When hiring developers – you MUST either know how to manage a development project, or have a team manager, or at least have a friend advising you.
  7. If you’re hiring for a small non-programming work, consider using fiverr.com.
  8. If you’re hiring for a more complex project or engagement, consider using upwork.
  9. If you’re using upwork, my preference is to filter on >90% job success, and prefer freelancers with significant experience on the platform – which means a lot of money earned or many job-hours done.
  10. For an additional useful listen (in Hebrew), try https://omny.fm/shows/odpodcast/shahar-erez-15-growth which I listened to recently – a lot of useful advice there.

I hope this is useful for you, please share with me in the comments your own techniques for hiring freelancers!

Filed under: Projects

Validating Flight Networks for Drones – part 1

Posted on 2020-07-10 by lorg Leave a Comment

A Flytrex drone lowering a package.

At Flytrex, in order to ensure drones fly only where allowed, and do not collide with each other, they are allowed to fly only in preplanned paths. However, if our goal is to deliver food to backyards, we can’t have a predefined path from our delivery center to each back yard, that would be wasteful. Instead, we created a flight network, which is a graph similar to a railway map, only instead of trains we have drones flying on it. This way, when a drone is required to a fly to a certain delivery point, a path is calculated on the flight network graph.

(Interesting note: while we could theoretically support a 3D flight network with routes above each other, we decided for simplicity’s sake to only allow planar flight networks for now.)

In order to support multiple drones flying at the same time, we also support semaphores and locking, but I’ll cover that subject at another time.

It is critical to verify that a flight network is correct and will not cause problems. We need to make sure that no two edges in the graph cross, otherwise the two drones flying through them might collide. Additionally, no two waypoints (nodes in the graph) may be too close to each other – for the same reason.

How do we do that?

First problem – for every two nodes n1 and n2, we need to make sure that distance(n1, n2) >= MIN_DISTANCE.

Second problem – for every two edges e1 and e2 that don’t share a node, we need to make sure that distance(e1, e2) >= MIN_DISTANCE. Of course, if they intersect then the distance is 0.

The simplest way to solve this is using brute force – go over every possible pair of nodes, and every possible pair of edges. This way however, is too slow – it’s the classic quadratic complexity. Can we do better?

For nodes, it is relatively easy: find the closest pair of nodes, n1 and n2. If distance(n1, n2) < MIN_DISTANCE return error, otherwise, the flight network is ok. How quickly can we implement closest-pair() for nodes? Apparently in O(nlogn), see this Wikipedia article and implementation guide.

We still have a problem though – both implementations assume Euclidean distance – and we need this implemented using e.g. Geodesic distance.

Here, this can be solved using one of two approaches:

  1. Project our nodes n1, n2 over an Euclidean surface such P(n1)=p1 and P(n2)=p2, and Geodesic-distance(n1, n2) ~= Euclidean-distance(p1, p2).
  2. Implement our closest-pair algorithm in a way that will work well enough over the earth’s surface.

(Note: happily enough, we can assume a maximum radius of a few kilometers for our flight network, otherwise this would have been much more complicated)

I tried first to go with option (1). However, how do you implement this projection correctly? I thought – let’s do a naive projection myself, using Wikipedia as a guide. Unfortunately again, this did not pan out. I took a sample of a few points, and calculated the Euclidean and Geodesic distances between them and compared. I got errors of 0%-15%.

15% error is way way too high.

Well, I don’t know, let’s get some help. Unfortunately, I wasn’t able to get this solution working using pyproj, and after some time spent I gave up on that direction. I decided to go back and try to reimplement closest-pair in a way that would work for Geodesic distances.

That worked!

To achieve that, I needed to replace all distance calculations with geodesic distance, and be able to go from a distance to latitude delta. Well, using the same calculation from before that is not too complicated. Also, we can have this support points with altitude without much effort/

Let’s write the function definition for closest-pair in a generic typed way:

def find_closest_pair(
        points: List[P],
        x_getter: Callable[[P], float],
        y_getter: Callable[[P], float],
        distance: Callable[[P, P], float],
        distance_to_x_delta: Callable[[float, P], float]) -> Tuple[P, P, float]:
 
    """Find the two points p1, p2 in points with the shortest distance between them
    Parameters:
        points: a list of points from which we would like to find the closest pair
        x_getter: a function that returns the x coordinate of a point
        y_getter: a function that returns the y coordinate of a point
        distance: a function that returns the distance between two points
        distance_to_x_delta: a function that given a point and a distance, 
            returns the difference in the x coordinate to get a 
            new point that is the given distance away
 
    Returns:
            The pair of closest points and the distance between them
    """

def find_closest_pair( points: List[P], x_getter: Callable[[P], float], y_getter: Callable[[P], float], distance: Callable[[P, P], float], distance_to_x_delta: Callable[[float, P], float]) -> Tuple[P, P, float]: """Find the two points p1, p2 in points with the shortest distance between them Parameters: points: a list of points from which we would like to find the closest pair x_getter: a function that returns the x coordinate of a point y_getter: a function that returns the y coordinate of a point distance: a function that returns the distance between two points distance_to_x_delta: a function that given a point and a distance, returns the difference in the x coordinate to get a new point that is the given distance away Returns: The pair of closest points and the distance between them """

In Part 2 I will write about validating the distance between all edges.

Filed under: Algorithms

Two bugs don’t make a right

Posted on 2012-08-12 by lorg 1 Comment

Three lefts roadsign
While working on my new startup, we are doing a little bit of reasoning using implications. One of the more curious forms of implications is the negative form: consider the following exaggerated example:

  • a place being kid-friendly implies that it is not romantic.
  • a place being a strip club implies it is not kid-friendly

If we allow negative implications to be transitive, then it would follow that since being a strip club makes a place less kid-friendly, it makes it more romantic. We don’t want that. So I had to write some code to specifically ignore that situation. Before writing that, in the best tradition of TDD I wrote a test for two chained negative implications. I implemented the code, the test passed and I was happy.

For a while.

Fast forward a couple of weeks, and I’m trying out adding some negative implications, and the program doesn’t behave as expected. My code doesn’t work. I turn back to my test, check it out, and sure enough, all the thing the test asserts as True are actually True, and the test does test the right thing.

Digging deeper, I discovered the issue. I had two bugs: the first was that the code handling chained negative implications wasn’t working right. The second was in my graph building algorithm – it seems that I was forgetting to add some edges. What made that second bug insidious was that it hid the effect of the first bug from the test – effectively making the test pass.

So – for me it was – two negative implications don’t mean a positive one, and two bugs don’t make a feature.

Filed under: Programming

Optimizing Django ORM / Postgres queries using left join

Posted on 2012-07-13 by lorg 1 Comment

For the latest project I’m working on, we’re using Django with Postgres. I was writing some code that had to find a list of objects that weren’t processed yet. The way they were stored in the DB is like so:

class SomeObject(models.Model):
    #some data
 
class ProcessedObjectData(models.Model):
    some_object = models.ForeignKey(SomeObject, db_index = True)
    #some more data

class SomeObject(models.Model): #some data class ProcessedObjectData(models.Model): some_object = models.ForeignKey(SomeObject, db_index = True) #some more data

In this schema, SomeObject is the original object, and a ProcessedObjectData row is created as the result of the processing. You might argue that the two tables should be merged together to form a single table, but that is not right in our case: first, SomeObject “has standing on its own”. Second, we are interested in having more than one ProcessedObjectData per one SomeObject.

Given this situation, I was interested in finding all the SomeObject’s that don’t have a certain type of ProcessedObjectData. A relatively easy way to express it (in Python + Django ORM) would be:

SomeObject.objects.exclude(id__in = ProcessedObjectData.objects.filter(...some_filter...).values('some_object_id'))

SomeObject.objects.exclude(id__in = ProcessedObjectData.objects.filter(...some_filter...).values('some_object_id'))

Unfortunately, while this is reasonable enough for a few thousand rows (takes a few seconds), when you go above 10k and certainly for 100k objects, this starts running slowly. This is an example of a rule of mine:

Code is either fast or slow. Code that is “in the middle” is actually slow for a large enough data-set.

This might not be 100% true, but it usually is and in this case – very much so.

So, how to optimize that? First, you need to make sure that you’re optimizing the right thing. After a few calls to the profiler I was certain that it was this specific query that was taking all of the time. The next step was to write some hand-crafted SQL to solve that, using:

SomeObject.objects.raw(...Insert SQL here...)

SomeObject.objects.raw(...Insert SQL here...)

As it turns out, it was suggested to me by Adi to use left-join. After reading about it a little bit and playing around with it, I came up with a solution: do a left join in an inner select, and use the outer select to filter only the rows with NULL – indicating a missing ProcessedObjectData element. Here is a code example of how this could look:

SELECT id FROM (
    SELECT some_object.id AS id, processed_object_data.id AS other_id FROM
    some_object
    LEFT JOIN
    processed_object_data
    ON
    (some_object.id = processed_object_data.some_object_id) AND
    (...some FILTER ON processed_object_data...)
) AS inner_select 
WHERE 
inner_select.other_id IS NULL
LIMIT 100

select id from ( select some_object.id as id, processed_object_data.id as other_id from some_object left join processed_object_data on (some_object.id = processed_object_data.some_object_id) and (...some filter on processed_object_data...) ) as inner_select where inner_select.other_id is NULL limit 100

That worked decently enough (a few seconds for 100k’s of rows), and I was satisfied. Now to handling the actual processing, and not the logistics required to operate it.

Filed under: Databases, Optimization

Collision: the story of the random bug

Posted on 2012-07-05 by lorg 2 Comments

So here I was, trying to write some Django server-side code, when every once in a while, some test would fail.
Now, it is important to know that we are using any_model, a cute little library that allows you to specify only the fields you need when creating objects, and randomizes the rest (to help uncover more bugs).

In this particular instance, the test that was failing was trying to store objects on the server using an API, and then check that the new objects exist in the DB. Every once in a while, an object didn’t exist. It should be noted that the table with the missing rows had a Djano-ORM URLField.

So first things first, I changed the code to print the random seed it was using on every failure. Now the next time it failed (a day later), I had the random seed in hand.

I then proceeded to use that random seed – and now I had a reproducible bug – it failed every time, consistently.

The next step was finding the cause of the bug. To cut a long story short – it turns out that it looked for an object with a specific URL. Which url? the url created for the first object (we had two).

The bug was that the second object was getting the same url as the first. I remind you, these urls are generated randomly. The troublesome url was http://72.14.221.99

I leave you now to guess/check what are the chances for the collision here
(the correct way to do that would be to check any_model’s code for generating urls, and not just say 1 in 2^32… :)

So I made sure the second object got a new url, and all was well, and the land had rest for forty years. (or less).

Filed under: Python

Cheap language detection using NLTK

Posted on 2012-06-30 by lorg 6 Comments

Some months ago, I was facing a problem of having to deal with large amounts of textual data from an external source. One of the problems was that I wanted only the english elements, but was getting tons of non-english ones. To solve that I needed some quick way of getting rid of non-english texts. A few days later, while in the shower, the idea came to me: using NLTK stopwords!

What I did was, for each language in nltk, count the number of stopwords in the given text. The nice thing about this is that it usually generates a pretty strong read about the language of the text. Originally I used it only for English/non-English detection, but after a little bit of work I made it specify which language it detected. Now, I needed a quick hack for my issue, so this code is not very rigorously tested, but I figure that it would still be interesting. Without further ado, here’s the code:

import nltk
 
ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS
 
STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()}
 
def get_language(text):
    words = set(nltk.wordpunct_tokenize(text.lower()))
    return max(((lang, len(words & stopwords)) for lang, stopwords in STOPWORDS_DICT.items()), key = lambda x: x[1])[0]
 
 
def is_english(text):
    text = text.lower()
    words = set(nltk.wordpunct_tokenize(text))
    return len(words & ENGLISH_STOPWORDS) > len(words & NON_ENGLISH_STOPWORDS)

import nltk ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english')) NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()} def get_language(text): words = set(nltk.wordpunct_tokenize(text.lower())) return max(((lang, len(words & stopwords)) for lang, stopwords in STOPWORDS_DICT.items()), key = lambda x: x[1])[0] def is_english(text): text = text.lower() words = set(nltk.wordpunct_tokenize(text)) return len(words & ENGLISH_STOPWORDS) > len(words & NON_ENGLISH_STOPWORDS)

The question to you: what other quick NLTK, or NLP hacks did you write?

Filed under: Python

Next Page »
© 2023 Algorithm.co.il - Algorithms, for the heck of it