Google is a great search engine, but there is growing user frustration. Why? Because if something makes money, Google wants to make sure you click on an ad to get to it, and if it doesn't make money, Google wants you to see the answer immedietly on Google and not go elsewhere. I don't believe this serves the interests of the users long term, and is an opportunity to create an alternative search engine to Google.
My proposal is to build a distributed search index on top of which both large scale and niche search engines can run. Distributed search engines are not a new idea, but I believe my approach in terms of how you bring it to market will help solve the chicken and egg problem of getting enough usage for it to work.
What does it take to build a new search engine? I would argue the 4 main components are:
- Crawling the internet to get the data you want to search
- Indexing the data you have crawled
- Searching and Ranking across the index
- User adoption of the new search engine
One of the largest challenges in building an effective search engine is to have a sufficiently large, and up to date index of everything on the web, so let's start there. Crawling and indexing the web is hard. 15% of Google searches have never been searched before. The important takeaway for me here isn't that people are asking new questions, it's that new things are happening and this makes up a significant portion of peoples interests. This means you need a dataset that is rapidly updating. There is a great core dataset you can get from Common Crawl, but it's updated monthly so results based on only this would be stale, and Common Crawl is limited in scope. From there you can layer on your own data, but due to the high percentage of automated traffic many websites will request you don't scrape via their robots.txt or outright block you via a firewall if you're not one of the major search engines. This gives large search engines a huge advantage since their large crawl dataset is the core on which the rest is built. Without a good dataset, you can't have a good search engine. Who can crawl every single website in the world, especially as the newest ones are being published or updated? The users visting these webpages. What if every single browser in the world retained a locally cached copy of the latest version of every website they visited? I built a proof of concept of this using a Chrome extensions + Mozilla's Readability to strip text from a given webpage, and then I do some basic full text search on the captured data. It works like expected and get's us to a good starting point. The Better History Chrome extension is a more mature implementation of this. This doesn't have to be done exclusively by end users, but if they are the primary crawlers in your day one architecture, I believe this search engine will have a significant competitive advantage. Additionally, we can get the majority of the value that a website offers it's users by extracting only the visible text at the time a user departs a given site. Yes this can be improved with on device ML to clasify pictures, and possibly smarter handling of common objects like tables of information, but I don't think it's necessary at the get go.
Ok so we have a single machine, with the dataset of websites the user has visited, and some full text search capability on it, how does that help us actually create a global searchable index? I would propose using Gnutella or a similar protocol. There are performance implications related to routing which are addressed further in this post.
Search and Rank
Once we have a global index, we need to be able to search it. I don't have a solution here. Searching per node, and then aggregating the results across nodes seems to be a solved problem by the large search engines, but I am not an expert on this problem. I expect that the network latency introduced by the nodes distribution on a public internet will also be an important issue to overcome. However, I have four core ideas which I believe are important to both search performance and quality:
- More recent versions of a result are more valuable. If user A has a copy of my site from a week ago, and user B has a copy of my site from an hour ago. The search results from user B have more value than user A. When a distributed search receives multiple conflicting results, the most recent one should be prioritized.
- Social graph proximity is a good proxy for relevancy, and authority. If a result comes from a friend, it's much less likely to be spam, or irrelevant than if I get it from someone 1 Kevin Bacon away from me.
- Central servers can collect information uploaded by individuals, and then act as a proxy on their behalf thereby reducing network distance and number of hops in the search graph.
- Nodes can cache data as it is recursed through them. This reduces the network distance between the node asking the question and the one answering it. However, this poses a numbr of challenges around expanded security risk, data storage needs, data transfer volume, and questions about how this should impact network distance as it impacts result quality as mentioned above.
Go to Market
Let's assume you can get this working as a fully funcitonal tech demo. Your ability to search a given index vs Google will be slower and less accurate, but you're still in a functional place. So how do you actually get this working as a business?
Start by not competing with public search. Start by offering this as an enterprise search offering. If a company uses Coveo or one of the other competitors, they're going to have challenges like the enterprise search provider not having connectors for X and Y tools they use. Especially internal one. Since your search product is based on web caching, this isn't a concern. You support every single web based tool on day 1. Important caveat here is that you will have to build out support for a few non-web based connectors for tools like Slack, MS Team, and Git providers, but that's the cost of doing business. At the end of the day, you still have fewer connectors to maintain, and that's a significant advantage. The other added benefit of starting by making this a B2B tool is that companies have a very accessible social graph which can be seen via their LDAP/AD/whatever other tool they use.
But there is a trade off, your index doesn't take into account permissions. For example, a user can search repeatedly with various string combinations to enumerate the contents of the cache, and thereby figure out private data that only HR should be seeing.
This will be addressed using two sets of include and exclude lists of URI regexes. The first set will manage what is and is not cached on a given node. The second set will be a map of URI regexes and either groups or individual users who should always or never be able to access these results and will act as a search filter as each node processes search results it is returning.
Additionally, I believe that organizations will want to have certain common URL paths that they always include and exclude in the index prior to searching ever happening. For example, we never want anything at
https://facebook.com/* to be indexed. These types of lists should be community contributed similar to how uBlock Origin or Spamhouse operates where organizations can subscribe to various lists to help enforce the behavior of their network. This isn't bulletproof, but you'll get 80% of the way, and it will improve over time.
Consumer Search Platform
Ok so let's assume this all works for companies, what does any of this have to do with creating an alternative to Google? Take all of those tools you have given enterprises to lock down what their users can do, and turn it into a platform provider. Instead of an org pushing policies down on users, you allow communities to build their own include and exclude list based search engines which users can subscribe to. Since it costs you near $0 to host these lists, you can give it to users as cheap as free and still make money overall.
- Paid access revenue split – A group of dentists could set up their own dentist focused filtered search and charge $10 a month for access, and you would take 10% fees from that.
- Ad platform – One of the Kardashian's could set up a search engine that only returns Kardashian approved merch and run ads on it to support the thing. You would provide the advertising platform and you take 30% of the revenue and the Kardashians get their 70%.
- Free – A group of kids want to set up a Minecraft specific search engine to exclude all the SEO spam and surface actually interesting resources across the web. They do it, they don't earn anything, you don't charge anything.
I am under no illusion that this is a perfect idea. If you want to build this, there are a couple critical issues you need to be considering from day one.
The ability for more niche search engines means that you will inevitable end up with partisan filtered search engines which will exacerbate echo chambers. This isn't about one side or the other, it's an inevitable outcome. The question is how will you deal with these from a policy and governance perspective.
A large % of global web traffic happens on mobile. This trend is accelerating. Mobile devices are not suited to act as cache or relay nodes in a network like this. This will particularly impact poorer devloping nations where mobile devices are the primary method of digital interaction. If you look at the history of P2P tech in these locations, it has inevitably failed and fallen back to a centralized model. I expect the same thing will happen here when it comes to storing the data, indexing it, and executing search queries. However, the mobile devices can still act as crawlers.
There are many security, privacy, and other related topics which are not covered at all here. I highly recommend reading the research done by the Tribler team.
Brave Search is actually similar to this idea via the WDP see the "Is the Web Discovery Project a crawler?" section.
There are number of personal archiving tools such as Monolith, SingleFile, Diskernet (fka 22120), Memex, and I am sure many others some of which you can find at https://github.com/iipc/awesome-web-archiving