Categories
Cryptography Math

10 Awesome Theorems & Results

When I look back at various mathematical courses I took, most have at least one theorem that I really liked. Usually I like it because the proof has a surprising trick, sometimes it’s because of the unexpected conclusion, or maybe the unintuitive feel to it. In other cases it’s just the elegance of the proof, or the result itself.
Without further ado, here’s a selection of my favorite theorems, in no particular order:

1. Linear Algebra: the Cayley Hamilton theorem. When I first grokked the fact that you can substitute matrices for the variables in polynomials, I was awestruck. Then I learned that you can define eA by using a Taylor series, but the fun doesn’t stop there. Using the Eigenvalues you can greatly simplify the calculation, and it all “works out the same” (i.e., if A=P-1DP and D is diagonal, then p(A) = P-1p(D)P. This works also for Jordan forms). Also, since you can show that complex numbers are isomorphic to the 2×2 matrices of the form [[a, b], [-b, a]], and that the calculations were exactly the same – well, everything “fell into place for me”. At the time it seemed to be one of the joys of Mathematics.

2. Calculus: the Bolzano-Weierstrass Theorem. One of the first non trivial results you learn in calculus, I originally learned the version that says: “Every bounded infinite set has a limit point”, and its proof was a bit more elegant in my eyes than the proof of the Wikipedia version. I liked it so much that one time when I was in boot camp in the service, I worked it out again just to keep my mind working. Good times.

3. Probability: The elegant result of V(x) = E(V(x|y)) + V(E(x|y)). Just the sight of it makes one sigh with contentedness, and the result itself is very nice.

4. Calculus, again: Stokes’ theorem and its friends. Very useful and non intuitive, in layman’s terms it says that you can reason about what happens in an area just by knowing about its perimeter.

5. Numerical Analysis: Richardson Extrapolation: one of the most elegant forms of bootstrapping, you start with a simple approximation method as a building block, and at the end you get a very strong high-quality approximation.

6. Computability: The Parameter theorem. Especially elegant, it basically gives the mathematical definition of the “bind” function for function parameters. In simple terms it uses the source code of a function f(x, y), to find the source code of a function g(y) such that g(y) = f(a, y) for some a. The nice thing about it is that it works only on source code, without calling the function themselves.
This theorem had the added bonus that once I grokked it, the test in computability was very, very easy :)

7. Functional analysis: here it’s a relatively minor result that I ended up remembering distinctly: Given z1.. zn which are linearly independent in E, show that there exists a d such that for each w1…wn that follow ||wi – zi|| < d for each i, are also linearly independent. The footnote says that such a finite, linearly independent group is called stable. When visualizing I think of it this way: given a such a group, kick it. As long as you don’t kick it too strongly – it will stay linearly independent. Now that’s stable.

8. Mathematical Logic: The Compactness theorem: “a set of first-order sentences has a model if and only if every finite subset of it has a model”. One direction is almost trivial, but the other is deep. When studying for the test in this course, I remember being stuck for days on an exercise that required the use of this theorem. Once I fully understood the method of its use, it became a favorite.
(By the way, the exercise was the following: Let G a countable group of first order statements, and p a first order statement. Show that if p is true in every countable model of G, than G |= p.)

9. Cryptography: I’ve learned a bit of cryptography on my own before taking the cryptography course. When I did though, two methods were especially memorable: The first was the “Meet in the Middle” attack. Not to be confused with “Man in the Middle”, this method allows one to attack symmetric ciphers constructed by repeatedly applying a simpler cipher. This known plaintext attack got its name from its method of operation: the attacker calculates all possible decryptions the ciphertext and stores them in a lookup table. Then, he calculates all encryptions of the plaintext and looks them up in that lookup table. Once a result is found – the combination of the encryption and the decryption keys used is the final key of the composed cipher.

10. The second cryptography result that I liked was secret sharing. Trivial secret sharing is so simple, and yet effective, that when I first learned it I thought: “how come I didn’t think of this before?”.

There are obviously many more elegant theorems, some of which I’ve learned in my studies. I sure hope to learn a few more. Still, these are special. As a highschool math teacher once told us about the Pythagorean theorem: “I want you to remember the proof even if I wake you in the middle of the night”. The theorems in this short list come close to that ideal.

Now I wonder – what are your favorite theorems?

Categories
Programming Philosophy

Beautiful Code

A few days ago, @edensh mentioned in Facebook beautiful code, and many people gave examples of assembly, while I was thinking of Python.

That got me thinking: what is beautiful code for me?

So here are my criteria for beautiful code:

  1. Readable (also visually pretty)
  2. Concise
  3. Does something non trivial (usually in an unexpectedly short manner)
  4. Good (solves the problem, efficiently)

If we consider code to be an implementation of a solution to a problem, than 3 & 4 usually apply to the solution, while 1 & 2 apply to the code itself. This brings me to why I like Python:

  1. Code is more readable. Specifically, I can still still easily understand code I wrote years ago. Also, Python’s zen encourages you to write readable code. For example “explicit is better than implicit” directly applies to readability.
  2. Python is visually appealing, although I guess that’s a matter of opinion :)
  3. Python almost always allows me to express my solutions easily & succinctly, whereas with other languages (C, C++, Java) I have to fight to “get my point across”.
  4. Python almost always has the right data structures to implement my solutions efficiently.

With that in mind, it’s clear to me now how assembly code can be beautiful.

Note that I didn’t mention C#, Ruby or Haskell. I don’t have much experience with these languages, but from what I’ve seen so far, it seems to me that these languages may help you write beautiful code. Of these, Haskell is probably going to be the first language I’ll learn – I think it will be the most educating experience, although I’m pretty sure others will argue with me regarding Haskell’s readability :)

Now, My question to you is: what do you think makes code beautiful?

Categories
Challenges Programming Python Statistics

Small programming challenge No. 6 – nblocks

I came up with this challenge when I had to write a function to divide a sequence to percentiles. I needed this to calculate some statistics over trips for plnnr.com. “This sounds trivial” I thought, and reached for my simple blocks function:

def blocks(seq, block_len):
    """blocks(range(5),2) -&gt; [[0, 1], [2, 3], [4]]"""
    seq_len = len(seq)
    if seq_len%block_len == 0:
        num_blocks = seq_len/block_len
    else:
        num_blocks = 1 + (seq_len/block_len)
 
    result = [[] for i in xrange(num_blocks)]
    for idx, obj in enumerate(seq):
        result[idx/block_len].append(obj)
    return result

So I set block_len to len(seq)/10 and called blocks(seq, block_len). Unfortunately, according to the docs of blocks (which I wrote…), when there is extra data, a “partial” block is added – which is exactly what we don’t want when calculating percentiles.
Instead, the behavior we want is nblocks(seq, number_of_blocks), for example: nblocks(range(10), 3) -> [0, 1, 2], [3, 4, 5], [6, 7, 8, 9].

This is a surprisingly hard to write function, or rather, harder than you’d expect. I’ll be especially glad if someone writes it elegantly. My solution works well enough, but isn’t the prettiest.

So, you have the definition – let’s see if you can do better than me. Once enough solutions are presented, I will present my own.

IMPORTANT EDIT: the required signature is nblocks(seq, num_blocks). So for seq(range(10), 3), the return value should be 3 blocks, with the last one having an extra item. As a general rule, the extra items should be spread as evenly as possible.

Categories
Miscellaneous

Moving to a new server

I moved the blog to a new server, and a new wordpress install.

It will take me some time to get everything back in order, but once I finish it, I’m going to get back to writing regularly. Stay tuned!

Categories
Python Sound

One-liner Guitar Tuner in Python

Guitar

On windows, assuming imports are free:

import winsound
winsound.Beep(220*((2**(1/12.0))**7), 2000)

But that’s just because I like to tune to E. If you prefer a more “natural looking” note, you can use A:

winsound.Beep(110, 1000)

But why choose at all when you can go for all of them?

[winsound.Beep(220*((2**(1/12.0))**i), 500) for i in [7, 2, -2, -7, -12, -17]]

Image by Keela84

Categories
Fractals Graphics Math Programming Python

Visualizing Data Using the Hilbert Curve

Some time ago, a coworker asked me to help him visualize some data. He had a very long series (many millions) of data points, and he thought that plotting a pixel for each one would visualize it well, so he asked for my help.
I installed Python & PIL on his machine, and not too long after, he had the image plotted. The script looked something like:

data_points = get_data_points()
n =  int((len(data_points)**0.5) + 0.5)
 
image = Image('1', (n, n))
for idx, pt in enumerate(data_points):
	image.putpixel(pt, (idx/n, idx%n))
image.save('bla.png', 'png')

Easy enough to do. Well, easy enough if you have enough memory to handle very large data sets. Luckily enough, we had just enough memory for this script & data series, and we were happy. The image was generated, and everything worked fine.

Still, we wanted to improve on that. One problem with this visualization is that two horizontally adjacent pixels don’t have anything to do with each other. Remembering xkcd’s “Map of the Internet“, I decided to use the Hilbert Curve. I started with wikipedia’s version of the code for the Python turtle and changed it to generate a string of instructions of where to put pixels. On the way I improved the time complexity by changing it to have only two recursion calls instead of four. (It can probably be taken down to one by the way, I leave that as a challenge to the reader :)

Unfortunately, at this point we didn’t have enough memory to hold all of those instructions, so I changed it into a generator. Now it was too slow. I cached the lower levels of the recursion, and now it worked in reasonable time (about 3-5 minutes), with reasonable memory requirements (no OutOfMemory exceptions). Of course, I’m skipping a bit of exceptions and debugging along the way. Still, it was relatively straightforward.

Writing the generator wasn’t enough – there were still pixels to draw! It took a few more minutes to write a simple “turtle”, that walks the generated hilbert curve.
Now, we were ready:

hilbert = Hilbert(int(math.log(len(data_points), 4) + 0.5))
for pt in data_points:
    x,y = hilbert.next()
    image.putpixel(pt, (x,y))

A few minutes later, the image was generated.

Categories
Python Testing

Fuzz-Testing With Nose

A few days ago, I found a in my website, plnnr.com. The bug was in a new feature I added to the algorithm. The first thing I did was write a small unit-test to reproduce the bug. With that unit-test in hand, I then worked to fix the bug, and got this unit-test to pass.

As I previously persumed this feature to be (relatively :) bug free, I decided that more testing was in order. This time however, a single test-case would not be enough – I needed to make sure that the trip-generation algorithm works in many cases. Enter fuzzing.

Plnnr.com generates trips according to trip preferences. Why not generate the trip preferences with a fuzzer, and then check if the planning algorithm chokes on them? While fuzzing is usually used to generate invalid input with the goal of causing the program to crash, in this case I’m generating valid input with the goal of causing the planning algorithm to fail.

Usually fuzzing is done with one of two techniques – exhaustive fuzzing, that goes systematically (possibly selectively) over the input space and random fuzzing, which picks inputs at random – or “somewhat” randomly. In my case, the input space consists of “world data” – locations of attractions, restaurants, etc, and trip preferences – intensity, required attractions, and so on. Since the input space is so large and “unstructured”, I found it much easier to go with random fuzzing.

In each test-case, I will generate a “random world”, and random trip preferences for that world.
Here is some sample code that shows how this might look:

trip_prefs.num_days = random.randint(0, 5)
trip_prefs.intensity = random(0, 5)
if randbit():
    trip_prefs.schedule_lunch = True

Where randbit is defined like so:

def randbit(prob = 0.5):
    return random.random() <  prob

This is all very well, but tests need to be reproducible. If a fuzzer-generated test case fails and I can’t recreate it to analyze the error and later verify that it is fixed, it isn’t of much use. To solve this issue, the input generation function receives some value, and sets the random seed with this parameter. Now, generating test cases is just a matter of generating a sequence of random values. Here is my code to do that:

class FuzzTestBase(object):
    __test__ = False
    def run_single_fuzz(self, random_seed):
        pass
    def fuzz_test(self):
        random.seed()
        random_seeds = [str(random.random()) for i in range(NUM_FUZZ_TESTS)]
        for seed in random_seeds:
            yield self.run_single_fuzz, seed

FuzzTestBase is a base-class for actual test classes. Each test class just needs to define its own version of run_single_fuzz, and in it call random.seed(random_seed) and log random_seed.

This code uses nose‘s ability to test generators: it assumes that a test generator yields test functions and their parameters.

A few interesting issues:
* I generate the random seeds beforehand, so that calling random.seed() in the actual test case doesn’t affect the seed sequence.
* Originally I used just random.random() as a seed instead of str(random.random()). The problem with that is that this way it’s not reproducible. random.random() returns a floating point value x, for which usually x != eval(str(x)):

In [10]: x = random.random()
In [11]: x == eval(str(x))
Out[11]: False

Even though x == eval(repr(x)) for that case, there’s still room for error. Unlike floating point numbers, it’s harder to go wrong with string equality. So str(random.random()) is just a cheap way to generate random strings.

I’d recommend that if your testing mostly consists of selected test cases based on what you think is possible user behavior, you might want to add some fuzzed inputs. I originally started the fuzz-testing described in this blog-post to better test for a specific bug. After adding the fuzz-testing, I found another bug I didn’t know was there. This just goes to show how useful fuzzing is as a testing tool. The fact that it’s so easy to implement is just a bonus.

Categories
Programming Security web-design

Open Redirects

In this post I’ll discuss an issue I tackled a short while ago – open redirects. But first, the story of how I got to it. Feel free to skip ahead to the technical discussion.

Background

Our analytics for plnnr.com – the website for trip planning wasn’t working as well as we wanted. We’re using Google Analytics, and it’s hard generating the specific report we want, and when we did get it, it seemed to show inaccurate numbers. To partially alleviate the issue, I was required to add tracking pixels for facebook & adwords, so we can better track conversions.
For us, an “internal” conversion is when a user clicks on a link to a booking url (for a hotel, or any other “bookable” attraction).
After reviewing it, I decided that the best course of action would be to create an intermediate page, on which the tracking pixels would appear. Such a page would receive as a parameter the url to redirect to, and will contain the appropriate tracking pixels.

Description of problem

Let’s say we build the url for the page like so:

/redirect/?url=X

This page will load the appropriate tracking pixels, and then redirect to the given url (the X).
The problems are:
1. We are potentially exposing ourselves to Cross Site Scripting (XSS), if we don’t filter the redirect url correctly. A malicious website could create links to our page that will run scripts in our context.

2. A malicious webmaster could steal search engine authority. Let’s say he has two domains: a.com and b.com, of which he cares about b.com. He creates links on a.com to:

ourdomain.com/redirect/?url=b.com

A search engine crawls his page, sees the links to his domain, and gives ourdomain.com authority to b.com. Not nice.

3. A malicious website could create links to ourdomain.com that redirect to some malware site, this way harming the reputation of ourdomain.com, or creating better phishing links for ourdomain.com.

Possible solutions

Before we handle the open-redirect issues it’s important to block cross site scripting attacks. Since the attack might be possible by inject code into the url string, this is doable by correctly filtering the urls, and by using existing solutions for XSS.

As for the open redirect:

1. Non solution: cookies. We can pass the url we want in a cookie. Since cookies may only be set by our domain, other websites would not be able to set the redirect url. This doesn’t work well if you want more than one redirect link, or with multiple pages open, etc.

2. Checking the referrer (“referer”), and allowing redirects to come only from our domain. This will break for all users who use a browser that hides referrer information, for example, those using zone-alarm. Google also suggests that if the referrer information is available, block if it’s external. That way we are permissive for clients that hide it.

3. Whitelisting redirect urls. This solutions actually comes in two flavors – one is keeping a list of all possible urls, and then checking urls against it. The other is keeping a list of allowed specific url parts, for example, domains. While keeping track of all allowed urls may be impractical, keeping track of allowed domains is quite doable. Downside is that you have to update that piece of the code as well each time you want to allow another domain.

4. Signing urls. Let the server keep a secret, and generate a (sha1) hash for each url of “url + secret”. In the redirect page, require the hash, and if it doesn’t match the desired hash, don’t redirect to that url. This solution is quite elegant, but it means that the client code (the javascript) can’t generate redirect URLs. In my case this incurs a design change, a bandwidth cost, and a general complication of the design.

5. Robots.txt. Use the robots.txt file to prevent search engines from indexing the redirect page, thereby mitigating at least risk number 2.

6. Generating a token for the entire session, much like CSRF protection. The session token is added to all links, and is later checked by the redirect page (on the server side). This is especially easy to implement if you already have an existing anti-csrf mechanism in place.

7. A combination of the above.

Discussion and my thoughts

It seems to me, that blocking real users is unacceptable. Therefor, only filtering according to referrer information is unacceptable if you block users with no referrer information.
At first I started to implement the url signing mechanism, but then I saw the cost associated with it, and reassessed the risks. Given that cross-site-scripting is solved, the biggest risk is stealing search-engine-authority. Right now I don’t consider the last risk (harming our reputation) as important enough, but this will become more acute in the future.

Handling this in a robots.txt file is very easy, and that was the solution I chose. I will probably add more defense mechanisms in the future. When I do add another defense mechanism, it seems that using permissive referrer filtering, and the existing anti-csrf code will be the easiest to implement. A whitelist of domains might also be acceptable in the future.

If you think I missed a possible risk, or a possible solution, or you have differing opinions regarding my assessments, I’ll be happy to hear about it.

My thanks go to Rafel, who discussed this issue with me.

Further reading

* http://www.owasp.org/index.php/Open_redirect
* http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=171297
* Open Redirects and Phishing Vectors/

Categories
Optimization Programming Python

Pyweb-il Presentation on Optimization Slides

Last Monday I gave a presentation in pywebil on optimization, that’s loosely based on my blog post on the same subject. Here are the slides for that presentation.

Categories
Javascript Optimization Programming Projects startup web-design

Javascript Element Creator

Some time ago I was working on optimizing the client side code of my website, plnnr.com, an online trip planner.
This website does automatic trip planning, and the problem was that recalculating trips was slow. After profiling, I found out that most of the time wasn’t actually taken up by the algorithm, but by the UI. Rendering the trip to html was the costly part. The process was like so:

Client-side Javascript code generates new trip prefs -> application calculates new trip -> Client-side Javascript gets the new trip, and creates new html.

It’s important to note that the app is “ajax based”, so the actual trip html was generated by the Javascript code, and not the server. At the time I was using Mochikit to generate the new html. Mochikit has a pretty nifty API for generating html, but it’s quite a bit on the slow side. Basically, this API is a wrapper around createElement.

Well, first I did a little test, and found out that generating html with cloneNode and innerHTML is much faster than createElement. Still, there was a problem – I needed to generate many similar elements – similar but not identical. Consider entries on a trip itinerary – they all look the same, yet each one has a different name, a different time string, and a different onclick event.

What I needed was a Javascript based html template library. My requirements:
1. Speed. Html had to be generated quickly.
2. Expressiveness. It had to be able to create pretty arbitrary html with a given context. For example, an anchor element (<a> tag) with a given href property, and a given text content.
3. References to inner elements: Many elements inside the generated html need various events attached to them, or various code processing. This should be easy to achieve.
4. The library has to allow the template html to be written as html, and not only as javascript strings.

So, I sat down with Benny, and we wrote the Javascript Element Creator, which we are now releasing under the BSD license. I originally wrote it to work with Mochikit and the Sizzle library, and Benny changed his version to worked with jquery.

After adding the code to my project, I got two things: first, everything worked much, much faster. Second, it was much easier changing the generated html when it was generated according to a template, and not directly in code.

Instructions

1. Write your html somewhere visible to the javascript code. Add the “template” class to the upper node, and the id will be the name of the template. For example:

<body>
    <div id="some_div" class="template">
    </div>
</body>

2. Similarly to other template engines, add double brackets to signify where text should be inserted:

<body>
    <div id="some_div" class="template">
        <a href="[[link_url]]">[[link_text]]</a>
    </div>
</body>

3. Create a creator object. It will “collect” your template, and will make it available to your code.

var creator = new ElementCreator();

4. Generate your DOM object, and add it to the document;

var obj = creator.generate("some_div",
                           {link_url: '/url/',
                            link_text: 'hello world'});
appendChildNodes(foo, obj);

The code

We decided to publish for now only the jquery version. I might publish the mochikit version as well at a later date. Since Benny wrote the jquery version, he also wrote the tests for that version.

All in all, the final code is pretty short, and could probably be even shorter. Still, it’s good enough, and gave me a very real performance boost.

Here is the code, have fun with it.