Privacy mode not so private

I like my privacy. I also prefer to keep my information secure. I might be a bit more paranoid than the rest, but not extremely so. A short while ago, I discovered something disturbing regarding Firefox. It seems to be a ‘secret everybody knows’, yet Firefox doesn’t say anything about that.

What is it? When using Firefox’s capabilities to ‘clear private data’ (under options->privacy), even when checking all the checkboxes, a lot of information is still kept. This is even true when using the new ‘private browsing’ which supposedly allows you to browse without any record kept.

How is the information kept? Using Local Shared Objects (LSO’s), which are basically cookies used by flash objects. Who uses these cookies? Almost everyone. The result? If you trusted Firefox so far to keep your browsing history secure, take a look at the following locations, and tell me what do you see.

How to mitigate? Simplest option is just to delete the files you see in the aforementioned locations. Better yet, install BetterPrivacy. Of course, you can also install any kind of flash blocker, or any other tool, to make sure you don’t keep those LSOs.

If you do end up using BetterPrivacy, be sure to check the “On cookie deletion also delete empty cookie folders” checkbox. If you don’t, while the cookies are no longer kept, the record of the sites you visited is still kept locally.

Posted in Security | Tagged , , , , , , | Leave a comment

PNG Minification

Following Ned Batchelder’s advice I’ve used pngout to minify the png’s on my startup’s web page.

It took them down from 641,281 bytes to 338,705. This is quite a nice return for the effort of a download and a single command line:

for %i in (*.png) do pngout “%i”

Posted in web-design | Tagged , , | 4 Comments

Threat analysis, security by obscurity and WordPress

Rusty Lock
  Image by Mykl Roventine

I’ve been running wordpress for a long time now, and luckily so far, it hasn’t been hacked.
Of course – this doesn’t prove anything, as I didn’t count hacking attempts. It also doesn’t show it’s unhackable – on the contrary, I believe that my wordpress installation is hackable by a determined attacker.

However, there’s a subtle issue at play regarding the ‘determined attacker’. There are several kinds of attackers today, and the two most notable ones are the ‘targeted attacker’ and the ‘mass attacker’. The targeted attacker aims to attack your resources specifically, probably because his interest in them. The mass attacker on the other hand, is interested in any resource like your own.

From this premise it follows that the two attackers will likely use different methods of operation. The mass attacker is looking to increase his ROI. He will use mass tools with the most coverage, and if an attack doesn’t work on a specific target, nevermind, it will work on others. For him, work is worthwhile only if it allows him to attack a substantial number of new targets.
In contrast, the targeted attacker’s goal is to break into your resources. For her the fact that a given attack will yield hundreds of other targets is irrelevant, unless it helps attacking you. She might start with top-of-the-shelf mass tools, but when these won’t work, the targeted attacker will study her target, until she finds a vulnerability, and then use it.

Now the question you should ask yourself – who are you defending against? When defending against a mass attacker, making yourself unnoticed and uncommon might be worthwhile. A little security by obscurity will most likely be enough to thwart most of the attacks.
Against targeted attacks you need a more solid defense, utilizing all the tricks in your bag, and still be aware that it probably won’t be enough. You should also seek to minimize damages in case of a successful attack.

Today, most wordpress blogs are under mass attacks. WordPress blogs are searched, exploited and the 0wned automatically, with the goal of getting the most coverage.
For some time now I’ve been using a small trick that helps to defend against mass attacks. The trick is simple – I added a small .htaccess file password-protecting the admin directory of my wordpress installation. Of course, in all probability the password may be bruteforced or completely bypassed by a very determined attacker, but against a mass attacker it is very effective.

I’ve seen suggestions to rename your files and dirs – this will probably also work. Still, it should be noted that this kind of methods only add obfuscation, thereby only protecting from mass attacks. Personally, I don’t consider the last method worthwhile – it complicates your installation and upgrade process, it requires much more work to be done right, and at most adds similar security to the .htaccess file, most likely less.

To conclude – do your threat analysis, and use the defense methods with the most ROI relative to that analysis. Just as another method – do consider using .htaccess files to prevent access to your admin directory.

Posted in Security | Tagged , , , , | 7 Comments

My solution to the counting sets challenge

A few days ago, I wrote up a challenge – to count the number of sets a given set is contained in.

In the comments, I touched briefly on the original problem from which the challenge was created, and I’ll describe it in more depth here.
In the problem, I am given an initial group of sets, and then an endless ‘stream of sets’. For each of the sets in the stream, I have to measure its uniqueness. relative to the initial group of sets. A set that is contained in only one set from the initial group is very unique, one that is contained in ten – not so much.

So how to solve this problem? My original solution is somewhat akin to the classic “lion-in-the-desert” problem, but more like the “blood test” story. I didn’t find a link to the story, so I’ll give it as I remember it.

In an army somewhere, it was discovered that at least one of the soldiers was sick and so had to be put in isolation until he heals. It is only possible to check for the disease via a blood test, but tests are expensive, and they didn’t want to test all of the soldiers. What did they do?

They took enough blood from each soldier. Now, from each sample they took a little bit, and divided the samples into two groups. They mixed together the samples of each group, and tested the mixed sample. If the sample was positive – they repeated the process for the blood samples of all the soldiers in the matching group.

Now my solution is clear: let’s build a tree of set unions. At bottom level will be the union of couples of sets. At the next level, unions of couples of couples of sets. So on, until we end up with just two sets, or even just one – if we are not sure the set is contained in any of the initial sets.

Testing is just like in the story. We’ll start at the two biggest unions, and work our way down. There is an optimization though – if a set appears more than say, 10 times, it’s not very unique, and its score is zeroed. In that case, we don’t have to go down all the way, but stop as soon as we pass the 10 “positive result” mark.

Here’s the code:

class SetGroup(object):
    def __init__(self, set_list):
        cur_level = list(set_list)
        self.levels = []
        while len(cur_level) > 1:
            self.levels.append(cur_level)
            cur_level = [union(couple) for couple in blocks(cur_level, 2)]
        self.levels.reverse()
 
    def count(self, some_set, max_appear = None):
        indexes = [0]
        for level in self.levels:
            indexes = itertools.chain((2*x for x in indexes), (2*x+1 for x in indexes))
            indexes = (x for x in indexes if x < len(level))
            indexes = [x for x in indexes if some_set <= level[x]]
            if max_appear is not None and len(indexes) >= max_appear:
                return max_appear
        return len(indexes)

Here’s a link to the full code.

I didn’t implement this solution right away. At first, I used the naive approach, of checking against each set. Then, when it proved to be too slow, I tried implementing the solution outlined by Shenberg and Eric in the comments to the challenge. Unfortunately, their solution proved to be very slow as well. I believe it’s because some elements appear in almost all of the sets, and so computing the intersection for these elements takes a long time.
Although originally I thought that my solution would suffer from some serious drawbacks (can you see what they are?), the max_appear limit removed most of the issues.

Implementing this solution was a major part of taking down the running time of the complete algorithm for the full problem I was solving from about 2 days, to about 15-20 minutes. That was one fun optimizing session :)

Posted in Challenges, computer science, Python | Tagged , , , | 5 Comments

Fractals in 10 minutes No. 6: Turtle Snowflake

I didn’t write this one, but I found it’s simplicity and availability so compelling, I couldn’t just not write about it.
In a reddit post from a while ago, some commenter named jfedor left the following comment:

A little known fact is that you can do the following on any standard Python installation:

from turtle import *
 
def f(length, depth):
   if depth == 0:
     forward(length)
   else:
     f(length/3, depth-1)
     right(60)
     f(length/3, depth-1)
     left(120)
     f(length/3, depth-1)
     right(60)
     f(length/3, depth-1)
 
f(500, 4)

If you copy paste, it’s a fractal in less than a minute. If you type it yourself, it’s still less than 10. And it’s something you can show a kid. I really liked this one.

Posted in computer science, Fractals, Python | Tagged , , | 4 Comments

Small Python Challenge No. 4 – Counting Sets

This is a problem that I encountered a short while ago. It seems like it could be easily solved very efficiently, but it’s not as easy as it looks.
Let’s say that we are given N (finite) sets of integers – S. For now we won’t assume anything about them. We are also given another set, a. The challenge is to write an efficient algorithm that will count how many sets from S contain a (or how many sets from S a is a subset of).

Let’s call a single test a comparison. The naive algorithm is of course checking each of the sets, which means exactly N comparisons. The challenge – can you do better? When will your solution outperform the naive solution?

I will give my solution in a few days. Submit your solutions in the comments, preferably in Python. You can write readable code using [ python ] [ /python ] blocks, just without the spaces.

Posted in Challenges, computer science | Tagged , , | 17 Comments

PyWeb-IL Presentation on Harvesting: Finding the Most Influential Artists

Yesterday I gave a presentation on harvesting to the PyWeb-IL group. In the presentation, I described what I learned about harvesting and also gave a concrete example of how to find the “most influential artists” using data from allmusic.com and a (very) naive implementation of PageRank.

The PageRank implementation was based on wikipedia word-by-word, and is not efficient, but it works well enough for this presentation. I included it and the allmusic.com example mostly because I thought the results are pretty cool, and it’s very good teaching material.

Here is the presentation, and the code is available here.

Here is how to run it:

D:\work\pywebil-harvesting\upload>allmusic.py "/cg/amg.dll?p=amg&sql=11:3pfrxqq5ld6e" 2 out.pkl

simple_pagerank.py out.pkl

Happy harvesting!

Posted in Python | Tagged , , , , | 1 Comment

Bulk INSERTs FTW

A short while ago, I had to research some API for a company I’m consulting for. This API yields very good quality data, but isn’t comfortable enough to process it for further research.
The obvious solution was to dump this data into some kind of database, and process it there.
Our first attempt was pickle files. It worked nicely enough, but when the input data was 850 megs, it died horribly with a memory error.

(It should be mentioned that just starting to work with the API costs about a 1.2 gigs of RAM.)

Afterwards, we tried sqlite, with similar results. After clearing it of memory errors, the code (sqlite + sqlalchemy + our code) was still not stable, and apart from that, dumping the data took too much time.

We decided that we needed some *real* database engine, and we arranged to get some nice sql-server with plenty of RAM and CPUs. We used the same sqlalchemy code, and for smaller sized inputs (a few megs) it worked very well. However, for our real input the processing, had it not died in a fiery MemoryError (again!) would have taken more than two weeks to finish.

(As my defense regarding the MemoryError I’ll add that we added an id cache for records, to try and shorten the timings. We could have avoided this cache and the MemoryError, but the timings would have been worse. Not to mention that most of the memory was taken by the API…)

At this point, we asked for help from someone who knows *a little bit* more about databases than us, and he suggested bulk inserts.

The recipe is simple: dump all your information into a csv file (tabs and newlines as delimiters).
Then do BULK INSERT, and a short while later, you’ll have your information inside.
We implemented the changes, and some tens of millions of records later, we had a database full of interesting stuff.

My suggestion: add FTW as a possible extension for the bulk insert syntax. It won’t do anything, but it will certainly fit.

Posted in Databases, Programming | Tagged , , , , , | 1 Comment

A New Kind of Journalism and Citizen Involvement

It’s become a fashion of late to write about the effect the Internet had on journalism, and the way people get informed. Usually the discussion revolves around blogs, twitter, how the newspapers are dying, and so on.

I’d like to point out something different that I’ve observed of late.

It started a few weeks ago, with the story of judge Drori. He acquitted a man who ran over a the clerk at the parking lot, because she refused to let him leave without paying, and stood in the way of his car. After reading the story, and the actual court ruling, there was a public outrage. Judge Drori wrote a ruling of about 300 pages, where he explains the acquittal. Many people commented about the ruling itself, its length, the reasons given in it and so on.

The second story is about the farmer Shay Dromi, who was acquitted today of killing. Two years ago, two Bedouin burglers broke into his farm at night, poisoned his dog, and then went about their business of stealing his property. At least they would have had Dromi hadn’t noticed them, confronted them, shot one to death, and wounded the other. This was amid a wave of crime and break-ins at the area, while the police weren’t doing much to stop that wave. As I said, Dromi was acquitted, the ruling was also published, and many people commented on the subject.

Now we can get to the point: usually, acquittals or convictions of the “small” people don’t merit much press. Judge Drori’s ruling probably would not have reached that publicity if he wasn’t up for a seat at the supreme court. Dromi’s story was publicized heavily a few years back, and the Knesset even changed the self-defense law because of this case.
However, the publishing of full-text rulings is new. Except for a case I was personally involved in, I never read court rulings. I don’t really know a lot about law.

Having these two stories published online, and not only in print, allows publishers to link to the full rulings. Newspapers will never come 300 attached pages of dense law text.
Yet online it’s easy as creating a link – and just like that, you have citizens reading rulings, understanding court processes, having opinions, commenting, and getting involved.

I find this amazing, and it makes me optimistic. The times are a-changing.

Posted in Uncategorized | Tagged , , , | 3 Comments

Origami Lizard On Book

Origami Lizard on Book

I took this quite some time ago. The book was lying on my desk, I don’t remember the specific reason I needed it at the time.
I thought I’d also prepare the instructions for this critter – they are quite easy and yield this nice lizard/dinosaur thingy. When I first created this fold, I thought it was quite an achievement to make it with four legs.

In any case, it came out a very nice image. I find it amusing that in the book I’m currently reading, there are mentions of computational origami as a prerequisite of transcendence.

Posted in Origami | Tagged , , | 1 Comment