Google Takes on Truth with a Capital “T”

light bulb worldMassive databases of “knowledge” are popping up all over the web, and they could change how we conceive of truth and information itself.

These databases are related to the semantic web and have already changed the way we search and the kinds of results that are returned. They use information scraped from all corners of the Internet and all kinds of files, using artificial intelligence botsAlso known as a robot or spider, this is a software program that follows links around the web and categorizes them into a database. Search engines and a variety of private companies use these to collect data. and ever-changing algorithms. Then, they are using machine learning to analyze the relationships between all the pieces of data.

One of the most exciting promises of these knowledge databases is that search engines such as Google and Bing will be able to use them to rank websites based on “endogenous” signals such as truth rather than just “exogenous” factors such as inbound linksHyperlinks from outside a root domain that point to a page within the domain., to use the language of search giant Google.

I see this as a largely positive change for truth seekers everywhere, but it brings up the age-old question about what truth is, how much consensus has to do with truth and whether we should be OK with bots making these decisions for us.

Knowledge Vault from Google

Google’s Knowledge Vault is probably the most talked about repository in the industry. According to a research report about the project, the vault contains about 1.6 billion “facts.” Of those, about 271 million have a 90 percent chance of being true.

The report explained how Google could use this vault and other factors to measure “knowledge-based trust,” or KBT. A site’s KBT score would affect its position in search results.

Here’s an excerpt:

“We propose using Knowledge-Based Trust (KBT) to estimate source trustworthiness as follows. We extract a plurality of facts from many pages using information extraction techniques. We then jointly estimate the correctness of these facts and the accuracy of the sources using inference in a probabilistic model. Inference is an iterative process, since we believe a source is accurate if its facts are correct, and we believe the facts are correct if they are extracted from an accurate source. We leverage the redundancy of information on the web to break the symmetry. Furthermore, we show how to initialize our estimate of the accuracy of sources based on authoritative information, in order to ensure that this iterative process converges to a good solution.”

As researchers explain, much of the “true” data is determined to be so because it can be found on the web in many places and the source is “accurate.” They’re probably right most of the time. As a news reporter, I learned that one of the best ways to approximate truth is to get as many sides of the story as possible. However, Google’s plan with Knowledge Vault brings up the question of “wikiality,” where truth is determined be consensus rather than reality and logic.

knowledge graph exampleIt is important to note here that, in contrast to some media reports, Google says its Vault is not part of Knowledge Graph, which started showing up on search results in 2012. The Graph has 750 million objects (things) and 18 billion facts (instances of things). These facts show up in boxes on search results pages for queries about everything from health conditions to nutrition, celebrities, historical facts and biology. Google’s Knowledge Graphs were originally fed by the open-source repository Freebase, as well as sources such as the CIA World Factbook. Google has announced it will shut down Freebase entirely by the end of June, instead using Wikidata.

 

While Knowledge Vault was just a research project, we’re confident that Google will incorporate it into search results in the near future. That means we can all be more confident that the search results we will get are correct, but even then we should remember to dig deep. It’s a bit counterintuitive, but if KBTs are included in page rank, visiting the tenth page of SERPs might finally be worth it.

Microsoft’s Historic Take On Amassing Knowledge

Bing’s digital assistant technology is called Satori, which has used Freebase and other open-source sites in addition to data scraped from elsewhere. The Microsoft-created software has used the Satori data repository to “build their repositories of semantically rich encoded information that can help answer questions in milliseconds, rather than deliver a page of links,” according to an article posted by Cnet.com. These results come in the form of Bing Snapshots, which look a lot like Google’s Knowledge Graphs.

“At Bing we believe that search should be more than a collection of blue links pointing to pages around the web,” wrote Dr. Richard Qian, from the Bing Index Team, in a 2013 blog introducing Snapshot.

Bing, which is now the automatic search engine on Apple Devices, is also being used as part of the recently announced improvements to the Spotlight search tool in iOS 8 and OS X El Capitan for Mac.

A New Database Player – DiffBot

diffbotJust last week, another big database player rose to prominence as it received $500,000 in capital from the venture fund Bloomberg Beta, in addition to other private financing. Diffbot, a California-based semantic search company, crawls the web for information and organizes that data into objects, which are built into a structured data set that can be programmatically reused by queryThis occurs when a user asks a search engine to perform a search and provide relevant results. The term can be used either for the string of words that made up the query or for the action of asking itself. engines. Diffbot’s business has been to sell CRM-related lead generation APIs to third-party companies, including heavyweights such as Cisco, Samsung, StumbleUpon, eBay, and CBS Interactive. The word on the street is that it may have been or soon will be incorporated into search engines, however.

Diffbot’s Global Index contains more than 600 million objects and 19 billion facts, according to an analysis article posted on Forbes.com by Anthony Wing Kosner. That data, which exceeds what Google has catalogued, is heavily driven by machine learning and artificial intelligence.

It’s interesting to note that Diffbot was showcased in 2012 in one of Bing’s LAUNCH events.

Data, Truth and Trustworthiness

Knowledge Graph and Bing Snapshots have already come under criticism for decreasing traffic on other sites, particularly Wikipedia. The concerns are that the people who originally posted the ideas aren’t being given credit for them and now have no way to monetize their efforts. There have also been concerns about accuracy of the knowledge posted in those boxes, but the search engines have introduced a few ways to challenge that data if you find errors. Google’s philosophy seems to be that if its searchers are really interested in a topic, they will look beyond the box to find more information.

A Look Ahead for Businesses, Marketing and SEO

These bot-driven databases are sure to change the world of search and, thereby, search engine optimization and digital marketing. The most important thing to do is make sure all the information on your site is up to date and accurate, including everything from blogs to spellings to dates and the names of places. We also recommend getting ahead of the curve by creating entries on free sites such as Wikimedia-owned properties that have accurate data about your site. Rich snippetsRich Site Summary is a format for the delivery of web content that changes regularly, such as content on a news site or blog. will also continue to be important, we suppose, and don’t forget about exogenous factors such as inbound links — they will still be counted no matter what.

As for truthiness and wikiality, it seems that all we can do for now is examine and discuss our philosophies. In the future, let’s keep the issue in mind, for the good of us all. The promise of credible information at our finger tips is thrilling, but the risk of inaccuracy can’t be overlooked.