AI3 Regular Blog

1 comment


Jan 12, 2013

I've been blogging more than usual since I released AI3 on Christmas Eve. You should check it out. In comparison to all websites I have released, AI3 has the most potential and should get the most respect. I purchased a super-fast server (SSD especially for fast database lookups), leased a super-fast colo space for it, and am going to add to it regularly. As a feature of AI3, I will attempt to keep a regular blog here with insight into what I think about each feature of the website is and then I will make a page with that data on ai3 using a simple slug. I've already done a few if you want to look at the past few blog posts.

The feature that I'm going to discuss today is single-minded research of a single difficult topic. Searching for a common word in Google can be one of the most frustrating things in the world. What you really want is for someone to answer the question you are asking, not learn every way to misunderstand what you are asking. Sometimes AI3 will fail, there's no doubt that Google is more in depth than anything I can create even if I had all of Wikipedia. So let's get in depth on a very simple question. It's not one of the easy questions I've been dealing with. Let's ask: "Is the word 'We' used more positively or negatively?" By that, I mean "Is the sentence 'We plan to solve poverty by 2017,' more common than 'We can not solve poverty by 2017'?" But not just that sentence, but every sentence which is in the positive "We *verb*" vs "We *verb* not". This is a deviously difficult problem. Even with a huge corpus, definitive answers require statistical analysis of a ton of stuff. Let's attempt it though. Start with We and we. All words in AI3 are case-sensitive, which is why there are links to all variants of we on the We word page. 1276 pages is too many unless we have a script. Let's try collocation of We. It's a slow process because We is such a common word. You can look below if you're impatient. While you're waiting, maybe try looking at a few sentences. The second sentence is:

`` We didn't want town work '', Jones said.
Eureka already? Yup. All we need to do is find similar words on We and every word that is in the negative. That's pretty easy, right? There are only four pages of words that contain n't and most of them are pretty uncommon. Note that there's a bug where dashes assume that two words are one. That's a problem with my parser which should be more intelligent about whitespace. So manually or automatically, we can start searching for sentences that contain We didn't and so on. Since the related page doesn't have a count (due to slowness), we are stuck just trying a high page number and using a binary search from there. If you don't know what a binary search is, let me explain. Let's say that there could be upwards of 100 pages of sentences or more. Simply skip to page 100. If it gives you an error, then there aren't that many pages. Go to half that number, page 50. Half the number again and again until you come up a valid page. Then pick a number half way between the valid page and the invalid page. After a few hits, you will find that page 6 is the end of We didn't. In total, it should only take 7 tries to find any number between 1 and 100 because 2^7 is 128. If you don't understand the math, hopefully you'll understand the process. Anyway, now we have a way of counting all the negative sentences. Then we simply need to count all the sentences that contain We. That can be found on the We word page. But let's say that you thought this algorithm through and have some skill with a database. How long would it take you to come up with the solution?

Read more »

AltSci Cell - About

Cell is place.
Welcome to Cell.

Read more »

« previous