Posts Tagged ‘Programming’

Open Redirects

Friday, March 26th, 2010

In this post I'll discuss an issue I tackled a short while ago - open redirects. But first, the story of how I got to it. Feel free to skip ahead to the technical discussion.

Background

Our analytics for plnnr.com - the website for trip planning wasn't working as well as we wanted. We're using Google Analytics, and it's hard generating the specific report we want, and when we did get it, it seemed to show inaccurate numbers. To partially alleviate the issue, I was required to add tracking pixels for facebook & adwords, so we can better track conversions.
For us, an "internal" conversion is when a user clicks on a link to a booking url (for a hotel, or any other "bookable" attraction).
After reviewing it, I decided that the best course of action would be to create an intermediate page, on which the tracking pixels would appear. Such a page would receive as a parameter the url to redirect to, and will contain the appropriate tracking pixels.

Description of problem

Let's say we build the url for the page like so:

/redirect/?url=X

This page will load the appropriate tracking pixels, and then redirect to the given url (the X).
The problems are:
1. We are potentially exposing ourselves to Cross Site Scripting (XSS), if we don't filter the redirect url correctly. A malicious website could create links to our page that will run scripts in our context.

2. A malicious webmaster could steal search engine authority. Let's say he has two domains: a.com and b.com, of which he cares about b.com. He creates links on a.com to:

ourdomain.com/redirect/?url=b.com

A search engine crawls his page, sees the links to his domain, and gives ourdomain.com authority to b.com. Not nice.

3. A malicious website could create links to ourdomain.com that redirect to some malware site, this way harming the reputation of ourdomain.com, or creating better phishing links for ourdomain.com.

Possible solutions

Before we handle the open-redirect issues it's important to block cross site scripting attacks. Since the attack might be possible by inject code into the url string, this is doable by correctly filtering the urls, and by using existing solutions for XSS.

As for the open redirect:

1. Non solution: cookies. We can pass the url we want in a cookie. Since cookies may only be set by our domain, other websites would not be able to set the redirect url. This doesn't work well if you want more than one redirect link, or with multiple pages open, etc.

2. Checking the referrer ("referer"), and allowing redirects to come only from our domain. This will break for all users who use a browser that hides referrer information, for example, those using zone-alarm. Google also suggests that if the referrer information is available, block if it's external. That way we are permissive for clients that hide it.

3. Whitelisting redirect urls. This solutions actually comes in two flavors - one is keeping a list of all possible urls, and then checking urls against it. The other is keeping a list of allowed specific url parts, for example, domains. While keeping track of all allowed urls may be impractical, keeping track of allowed domains is quite doable. Downside is that you have to update that piece of the code as well each time you want to allow another domain.

4. Signing urls. Let the server keep a secret, and generate a (sha1) hash for each url of "url + secret". In the redirect page, require the hash, and if it doesn't match the desired hash, don't redirect to that url. This solution is quite elegant, but it means that the client code (the javascript) can't generate redirect URLs. In my case this incurs a design change, a bandwidth cost, and a general complication of the design.

5. Robots.txt. Use the robots.txt file to prevent search engines from indexing the redirect page, thereby mitigating at least risk number 2.

6. Generating a token for the entire session, much like CSRF protection. The session token is added to all links, and is later checked by the redirect page (on the server side). This is especially easy to implement if you already have an existing anti-csrf mechanism in place.

7. A combination of the above.

Discussion and my thoughts

It seems to me, that blocking real users is unacceptable. Therefor, only filtering according to referrer information is unacceptable if you block users with no referrer information.
At first I started to implement the url signing mechanism, but then I saw the cost associated with it, and reassessed the risks. Given that cross-site-scripting is solved, the biggest risk is stealing search-engine-authority. Right now I don't consider the last risk (harming our reputation) as important enough, but this will become more acute in the future.

Handling this in a robots.txt file is very easy, and that was the solution I chose. I will probably add more defense mechanisms in the future. When I do add another defense mechanism, it seems that using permissive referrer filtering, and the existing anti-csrf code will be the easiest to implement. A whitelist of domains might also be acceptable in the future.

If you think I missed a possible risk, or a possible solution, or you have differing opinions regarding my assessments, I'll be happy to hear about it.

My thanks go to Rafel, who discussed this issue with me.

Further reading

* http://www.owasp.org/index.php/Open_redirect
* http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=171297
* Open Redirects and Phishing Vectors/

Ethics in Programming

Monday, March 8th, 2010

Some time ago I was bothered by the issue of ethics in programming.
I heard the question best raised during a "game unconference" I attended. There was a panel about monetary systems for games, and people talked about the issues faced when adding money to an online game.
At one point someone from the audience said about ingame monetary systems (such as in WoW) "it's like gambling and drugs!", to which one panelist jokingly replied "so we have a proven business model", and another said "except it's legal".

This was all in good spirit, but it got me thinking:

What are the programming jobs I will not take?

(more...)

Small Programming Challenge no. 5 – Generating a Permutation

Wednesday, November 11th, 2009

I thought of this one quite a long time ago, and I believe that the idea behind it is pretty nice mathematically. I got the idea for it from Knuth's "The Art of Computer Programming".

The challenge is simple:
write a function that receives as arguments two numbers, n, and num such that 0 <= num < n!. This function needs to return an array (list) representing a permutation of the numbers 0..n-1. For each possible num, the function needs to return a different permutation, such that over all values of num, all possible permutations are generated. The order of permutations is up to you.

The function you write should do this in at most O(n) time & space (Various O(nlogn) are also acceptable).
Write your solutions in the comments, in [ LANG ] [/ LANG ] blocks (without the spaces) where LANG is preferably Python :). I will post my solution in a few days. As usual, the most efficient & elegant solution wins.

Go!

10 Python Optimization Tips and Issues

Monday, September 21st, 2009

Following my previous post on Optimizing Javascript, I thought I'd write a similar post regarding Python optimization.
(more...)

Checking the ulam spiral

Thursday, September 17th, 2009

In the following post, the Ulam spiral is described. It's a very simple object - write down consecutive natural numbers starting from 41 in a square spiral. Curiously, the numbers on the diagonal are primes:

Ulam spiral

Reading this post, I immediately wanted to check how long this property holds. The original blog post suggests:

Remarkably, all the numbers on the red diagonal are prime — even when the spiral is continued into a 20 × 20 square.

Yet it doesn't mention if it stops there or not. Without peeking (in Wikipedia), I wrote a simple program to check:

import sys
import psyco
psyco.full()

UP = 0
LEFT = 1
DOWN = 2
RIGHT = 3

DIRECTIONS = {UP: (0, -1),
              LEFT: (-1, 0),
              DOWN: (0, 1),
              RIGHT: (1, 0)}

def generate_ulam_seq(n):
    start = 41
    x = 0
    y = 0
    min_x = 0
    max_x = 0
    min_y = 0
    max_y = 0
    value = start
    direction = UP
    for i in xrange(n):
        yield x, y, value
        value += 1
        add_x, add_y = DIRECTIONS[direction]
        x, y = x + add_x, y + add_y
        if x <min_x:
            direction = (direction+1) % 4
            min_x = x
        if x> max_x:
            direction = (direction+1) % 4
            max_x = x
        if y <min_y:
            direction = (direction+1) % 4
            min_y = y
        if y> max_y:
            direction = (direction+1) % 4
            max_y = y

def is_prime(n):
    return all(n%x != 0 for x in xrange(2, int((n**0.5) + 1)))

def main():
    for x, y, v in generate_ulam_seq(int(sys.argv[1])):
        if x == y and not is_prime(v):
            print x, y, v

if __name__ == '__main__':
    main()

Running it, we get:

> ulam.py 10000
20 20 1681
-21 -21 1763
22 22 2021
-25 -25 2491
28 28 3233
-33 -33 4331
38 38 5893
-41 -41 6683
41 41 6847
42 42 7181
-44 -44 7697
-45 -45 8051
-46 -46 8413
48 48 9353

So the property doesn't hold for a very long time. Reading up on Wikipedia, it seems the Ulam spiral still has interesting properties related to primes in higher sizes. Maybe I'll plot that one as well in a future post.

PS: Regarding primality testing, this is a cute, quick&dirty one liner. In the range of numbers we're discussing, it is 'good enough'.

My solution to the counting sets challenge

Monday, September 7th, 2009

A few days ago, I wrote up a challenge - to count the number of sets a given set is contained in.

In the comments, I touched briefly on the original problem from which the challenge was created, and I'll describe it in more depth here.
In the problem, I am given an initial group of sets, and then an endless 'stream of sets'. For each of the sets in the stream, I have to measure its uniqueness. relative to the initial group of sets. A set that is contained in only one set from the initial group is very unique, one that is contained in ten - not so much.

So how to solve this problem? My original solution is somewhat akin to the classic "lion-in-the-desert" problem, but more like the "blood test" story. I didn't find a link to the story, so I'll give it as I remember it.

In an army somewhere, it was discovered that at least one of the soldiers was sick and so had to be put in isolation until he heals. It is only possible to check for the disease via a blood test, but tests are expensive, and they didn't want to test all of the soldiers. What did they do?

They took enough blood from each soldier. Now, from each sample they took a little bit, and divided the samples into two groups. They mixed together the samples of each group, and tested the mixed sample. If the sample was positive - they repeated the process for the blood samples of all the soldiers in the matching group.

Now my solution is clear: let's build a tree of set unions. At bottom level will be the union of couples of sets. At the next level, unions of couples of couples of sets. So on, until we end up with just two sets, or even just one - if we are not sure the set is contained in any of the initial sets.

Testing is just like in the story. We'll start at the two biggest unions, and work our way down. There is an optimization though - if a set appears more than say, 10 times, it's not very unique, and its score is zeroed. In that case, we don't have to go down all the way, but stop as soon as we pass the 10 "positive result" mark.

Here's the code:

class SetGroup(object):
    def __init__(self, set_list):
        cur_level = list(set_list)
        self.levels = []
        while len(cur_level)> 1:
            self.levels.append(cur_level)
            cur_level = [union(couple) for couple in blocks(cur_level, 2)]
        self.levels.reverse()

    def count(self, some_set, max_appear = None):
        indexes = [0]
        for level in self.levels:
            indexes = itertools.chain((2*x for x in indexes), (2*x+1 for x in indexes))
            indexes = (x for x in indexes if x <len(level))
            indexes = [x for x in indexes if some_set <= level[x]]
            if max_appear is not None and len(indexes)>= max_appear:
                return max_appear
        return len(indexes)

Here's a link to the full code.

I didn't implement this solution right away. At first, I used the naive approach, of checking against each set. Then, when it proved to be too slow, I tried implementing the solution outlined by Shenberg and Eric in the comments to the challenge. Unfortunately, their solution proved to be very slow as well. I believe it's because some elements appear in almost all of the sets, and so computing the intersection for these elements takes a long time.
Although originally I thought that my solution would suffer from some serious drawbacks (can you see what they are?), the max_appear limit removed most of the issues.

Implementing this solution was a major part of taking down the running time of the complete algorithm for the full problem I was solving from about 2 days, to about 15-20 minutes. That was one fun optimizing session :)

Small Python Challenge No. 4 – Counting Sets

Friday, August 28th, 2009

This is a problem that I encountered a short while ago. It seems like it could be easily solved very efficiently, but it's not as easy as it looks.
Let's say that we are given N (finite) sets of integers - S. For now we won't assume anything about them. We are also given another set, a. The challenge is to write an efficient algorithm that will count how many sets from S contain a (or how many sets from S a is a subset of).

Let's call a single test a comparison. The naive algorithm is of course checking each of the sets, which means exactly N comparisons. The challenge - can you do better? When will your solution outperform the naive solution?

I will give my solution in a few days. Submit your solutions in the comments, preferably in Python. You can write readable code using [ python ] [ /python ] blocks, just without the spaces.