Rank Algorithm - an overview (2023)

Related terms:

  • Social Network
  • Compressive Sensing
  • Nuclear Norm
  • Component Analysis
  • Minimization Problem
  • Research Worker
  • Singular Value
View all TopicsNavigate Right

News Search Ranking

Bo Long, Yi Chang, in Relevance Ranking for Vertical Search Engines, 2014

2.1.2 Combine Relevance and Freshness

Learning-to-rank algorithms have shown significant and consistent success in various applications [226,184,406,54]22618440654. Such machine-learned ranking algorithms learn a ranking mechanism by optimizing particular loss functions based on editorial annotations. An important assumption in those learning methods is that document relevance for a given query is generally stationary over time, so that, as long as the coverage of the labeled data is broad enough, the learned ranking functions would generalize well to future unseen data. Such an assumption is often true in Web searches, but it is less likely to hold in news searches because of the dynamic nature of news events and the lack of timely annotations.

A typical procedure is as follows:

Collect query-URL pairs.

Ask editors to label the query-URL pairs with relevance grades.

Apply a learning-to-rank algorithm to the train ranking model.

Traditionally, in learning-to-rank, editors label query-URL pairs with relevance grades, which usually have four or five values, including perfect, excellent, good, fair, or bad. Editorial labeling information is used for ranking model training and ranking model evaluation. For training, these relevance grades are directly mapped to numeric values as learning targets.

For evaluation, we desire an evaluation metric that supports graded judgments and penalizes errors near the beginning of the ranked list. In this work, we useDCG [175],


where i is the position in the document list, and Gi is the function of relevance grade. Because the range of DCG values is not consistent across queries, we adopt the NDCG as our primary ranking metric,


where Zn is a normalization factor, which is used to make the NDCG of the ideal list be 1. We can use NDCG1 and NDCG5 to evaluate the ranking results.

We extend the learning-to-rank algorithm in news searches, for which we mainly make two modifications due to the dynamic nature of the news search: (1) training sample collection and (2) editorial labeling guideline. Training Sample Collection

The training sample collection has to be near real time for news searches by the following steps:


Sample the latest queries from the news search query log.


Immediately get the candidate URLs for the sampled queries.


Immediately ask editors to do judgments on the query-URL pairs with relevance and freshness grades.

We can see that all the steps need to be accomplished in a short period. Therefore, the training sample collection has to be well planned in advance; otherwise, any delay during this procedure would affect the reliability of the collected data. If queries are sampled from an outdated query log or if all of the selected candidate URLs are outdated, they cannot represent the real data distribution. If editors do not label query-URL pairs on time, it will be difficult for them to provide accurate judgments, because editors’ judgments rely on their good understanding of the related news events, which becomes more difficult as time elapses. Editorial Labeling

In a news search, editors should provide query-URL grades on both traditional relevance and freshness. Although document age is usually available in news searches, it is impossible to determine a document’s freshness based solely on document age. A news article published one day ago could either be very fresh or very old, depending on the nature of the related news event. So, we ask editors to provide subjective judgments on document freshness, which can have a few different grades, such as the following:

Very fresh: latest documents (promote grade: 1)

Fresh (promote grade: 0)

A little bit outdated (demote grade: −1)

Totally outdated and useless (demote grade: −2)

For ranking model training, we combine relevance grade and recency grade for a new grade as learning target. An example of such a grade combination is shown in the list. If the document is very fresh, we promote one grade from its original relevance grade. For example, if a document has a good grade on relevance and a fresh grade on freshness, its final grade is excellent, which is one grade higher than it is relevance grade, good. If the document is fresh, we neither promote nor demote. If the document is a little bit outdated, we demote one grade from its original relevance grade. If the document is totally outdated and useless, we demote two grades.

For ranking model evaluation, we can compute DCG values based on either the combined grades or the relevance grade and freshness grade separately.

To evaluate freshness in isolation, we also include a freshness metric based on DCG,called DCF:


where i is the position in the document list, and Fi is the freshness label (1 or 0). A query may have multiple very fresh documents—for example, when multiple news sources simultaneously publish updates to some ongoing news story. Note that DCF is a recency measurement that is independent of overall relevance. Therefore, when we evaluate a ranking, we should first consider demoted NDCG, which represents the overall relevance, then inspect the value of the DCF. We also define normalized discounted cumulative freshness (NDCF) in the same way as for NDCG in Eq. 2.3.

View chapterPurchase book

Read full chapter



Fact-finding in information networks

Dong Wang, ... Lance Kaplan, in Social Sensing, 2015


This chapter reviews the state-of-the-art fact-finding schemes developed for trust and credibility analysis in information networks with a focus on a Bayesian interpretation of the basic fact-finding scheme. The fact-finding algorithms rank a list of claims and a list of sources by their credibility, which can be used toward solving the reliable social sensing problem. In particular, we review a Bayesian interpretation of the basic mechanism used in fact-finding from information networks in detail. This interpretation leads to a direct quantification of the accuracy of conclusions obtained from information network analysis. Such quantification scheme is of great value in deriving reliable information from unreliable sources in the context of social sensing.

View chapterPurchase book

Read full chapter



Case Studies

Israel Koren, C. Mani Krishna, in Fault-Tolerant Systems (Second Edition), 2021

8.10.3 Fault Tolerance as a Service

Just as computational cycles are bought as a service from the provider, fault tolerance can be purchased as well. The user specifies certain capabilities that need to be replicated for fault tolerance. The service provider then matches available resources to such incoming demands from the user. The provider must also continually monitor the functioning of these resources so that the appropriate service level is maintained. Sometimes, this requires the remapping (reassignment) of resources to users.

The user may decide that it is too expensive to replicate everything related to their application. Instead, replication may be purchased for just the most critical components. The question is how to determine which components are the most critical. One suggestion is to use an approach that draws its inspiration from the well-known page rank algorithm used to identify important web pages. The idea is that a component which is invoked often is likely to be more important. A directed graph is created, with individual components being the nodes and directed edges denoting an invocation relationship, that is, if component a invokes a service on component b, then there is an edge ea,b connecting the nodes na, nb in the system graph. Let ϕa,b be the frequency with which component a invokes a service on component b. Then, the weight of the edge ea,b is calculated as wa,b=ϕa,bjϕa,j. If a node invokes no other component, we create an edge from this node to every node in the system (including itself); the weight of each such edge is then set to 1/m, where m is the number of nodes in the system. Note that the weights of the edges leaving a node always sum to 1.

Let N(a) denote all the nodes which have a link from node na. Define a parameter d[0,1], whose meaning will be described in a moment. Now, define the following equation of the significance level of component a:


To obtain a solution to this recursion, we start with random values for the U(k), and then iteratively apply the recursion until it converges. The values U(a) can then be treated as the relative importance of component a.

Let us now turn to the value of d and the role it plays. Note that if d=1, then U(a)=1/n for each component a, and there is no need for a recursion; each component is equally important. If d=0, then the importance of component a stems entirely from its relationship to the other components.

Secondly, note that the system of Eqs. (8.1) are dependent. To obtain absolute values for them, we will need to add a boundary condition: aU(a)=A, where A is some suitable value (often taken as 1).

Note that apart from the invocation frequency of one component by another, we have not included any domain knowledge in our model. This is not difficult to include: see the Further Reading section for a pointer.

The user can then assign redundancy to components in line with their significance.

View chapterPurchase book

Read full chapter



Inviting to Participation

Stefan Bente, ... Shailendra Langade, in Collaborative Enterprise Architecture, 2012

A primer on Enterprise 2.0

The term Enterprise 2.0 was coined by Andrew McAfee, professor at the MIT Sloan School of Management at Harvard. In 2006, McAfee published a paper, Enterprise 2.0: The Dawn of Emergent Collaboration (McAfee, 2006a), in which he described social software as a means of collaboration in an enterprise context. The paper gained a great deal of attention. Social software had always been regarded as a playground for leisure time or private activities; that it could also add value to the grave, professional world of modern enterprises was a new idea.

McAfee took up two trains of thought that were widely discussed and applied them to his research topic: the “impact of IT on how businesses perform and how they compete” (McAfee, 2009, p. 3). One train of thought is Web 2.0. The other is a series of related research about collective intelligence—the way communities collaborate, coordinate their activities, manage knowledge, and come to new insights.

Let's first take a look at Web 2.0.2 There is no straightforward definition of this term. An authority like Sir Timothy Berners-Lee, inventor of HTML, professor at the Massachusetts Institute of Technology (MIT), and director of the World Wide Web Consortium (W3C) even regarded it as “[…] a piece of jargon, nobody even knows what it means” and claimed that it actually adds nothing that wasn't there in the good old Web 1.0 (Berners-Lee, 2006). The “tag cloud” of notions people associate with Web 2.0 is indeed diverse and full of concepts lacking a reliable definition.

The most important description of Web 2.0 goes back to Tim O'Reilly, founder and CEO of O'Reilly Media, in 2005. In his much-quoted article “What Is Web 2.0?” he defends the term against the criticism of fuzziness: “Like many important concepts, Web 2.0 doesn't have a hard boundary, but rather, a gravitational core” (O'Reilly, 2005). In other words, just as there is no hard boundary between jazz and pure noise, there's no hard boundary between things that are Web 2.0-like and things that are not—but that doesn't render the concept meaningless.

Rank Algorithm - an overview (1)

Figure 8-1. Web 2.0 tag cloud.

Web 2.0 is a conglomerate of technologies, software principles, business models, and, most important, a different user paradigm of the Internet as a social medium. Web things scoring 100% in all these aspects clearly are at the gravitational core of Web 2.0, as shown in Figure 8-2.

Rank Algorithm - an overview (2)

Figure 8-2. Aspects and the gravitational core of Web 2.0.

Take, for instance, a Web application built on specific technologies according to specific software principles, distributed according to new business models, and addressing the user as a social being; this application would beyond doubt deserve the label “Web 2.0.” On the other side, if we stepwise subtract these properties, we more and more move toward the boundary until it eventually becomes dubious whether it still is Web 2.0 or not.

We do not want to go deeper into the characteristic technologies, since there is a broad literature available on this topic. Nevertheless, to indicate what we mean by Web 2.0 technologies, here's a short and certainly not exhaustive list:

Feeds (RSS or Atom), permalinks, and trackbacks

Web-services (REST or SOAP WS-*)

Server-side component frameworks such as Java Server Faces or Microsoft ASP.NET

Full-stack-frameworks for rapid development such as Ruby on Rails or Grails

Client-side technologies such as AJAX, JavaScript, Adobe Flex, or Microsoft Silverlight

Content and application integration technologies, namely portals and in particular mashups such as IBM Mashup Center or Software AG MashZone

These technologies are the enablers for the new Web 2.0 game. Feeds, permalinks, and trackbacks, to explore some examples, are the foundation of the blogosphere, the social ecosystem of cross-connected Web blogs.

The software principles outlined here as a second characteristic of Web 2.0 guide the way software is designed and developed on the new Web. The most eye-catching principle is the design for a richer user experience, or, in more glowing words, for joy of use. It is the quality leap from the stubborn, form-based request-response behavior of classical Web applications to the handsome comfort we've gotten used to with desktop applications. Modern Web applications offer fluid navigation, context-aware menus and help, visual effects and multimedia, and a look that doesn't feel like something out of a bureaucrat's cabinet.

Another principle is the so-called perpetual beta. Contrary to traditional software products that are released as ready-made, compact feature packs in relatively long-lasting release cycles, Web 2.0 software is typically in a state of continuous improvement. It is released frequently in short cycles,3 and upgrades happen seamlessly, without the blare of trumpets that announce traditional software releases.

The perpetual beta also invites users to be development partners. It is a trial-and-error approach to guessing what the user wants. Features are tentatively placed on a Website, but if users do not embrace them, the provider removes them. Both the “release soon and often” factor and the unresisting adaptation to user behavior also characterize lean and agile development approaches, as we know from Chapter 7. These approaches therefore are natural complements to Web 2.0.

The last principle we want to outline here is the design for remixability. This means that data and functionality are delivered to the client in the form of extensible services, as opposed to the closure of functions traditional applications offer. The services are intended to be remixed by the service consumer, even in a way that the provider has not got the faintest idea about. Their design therefore strives for lightweight APIs, versioning, and other ingredients of loose coupling.

Remixability brings us to the business models characterizing Web 2.0. The business models of Web 2.0 are no longer centered on competition between applications. Internet users today simply expect reasonable user interfaces, but it is nothing they get excited about. What separates the wheat from the chaff are the services, in particular the data offered by a competitor. Companies that provide highly valuable data in a remixable way and that demonstrate operational excellence in ensuring performance, security, and other nonfunctional properties are gaining market share.

Another feature of Web 2.0 business models is that they serve the long tail. The long tail (see Figure 8-3)4 is a metaphor the journalist Chris Anderson (2006) applied to the large market segment of diversified low-volume offerings.

Rank Algorithm - an overview (3)

Figure 8-3. Long tail of product distribution.

Internet marketplaces such as Amazon demonstrate strong support of the long tail. Bookworms find rare bibliophilic editions there because Amazon remixes the services offered by smaller or specialized booksellers and creates a seamless link between seller and bookworm. This is a nice example of how the openness and remixability of Web 2.0 services enable providers to meet individualized, uncommon requirements.

A third feature of Web 2.0 business models is that they turn the formerly passive role of the consumer into active participation. Consumers act, for example, as marketing agents: They spread the news and—consciously or not—recommend products. Marketing folks use the (slightly insane) term viral marketing for the methodology to influence this genuinely autonomous consumer activity.

Tim O'Reilly finds this kind of marketing so characteristic that he writes: “You can almost make the case that if a site or product relies on advertising to get the word out, it isn't Web 2.0” (O'Reilly, 2005). Consumers have an important say in the public appearance of a product. They evaluate products, write critiques and user guides, publish hints and tricks, relate products to each other, and so forth. In contrast to pre-Web 2.0 times, it is now much more difficult for a company to control the public image of their offerings.

The last feature of Web 2.0 business models we'd like to mention here is that services get better the more people use them. Examples of this characteristic are public spam filters or file distribution models such as BitTorrent. Such services functionally depend on being used, and the more users there are, the better the services work.

The features of business models already indicate what can be regarded as a litmus test for Web 2.0. It is the question whether the Internet is utilized as just a rather one-directional publication channel or as a platform that addresses users as social beings and invites them to participation and exchange. Figure 8-4 nicely illustrates this distinction in paradigms.

Rank Algorithm - an overview (4)

Figure 8-4. Information source versus platform of participation.

The social momentum of Web 2.0 is particularly visible in social software, which is an important subcategory of Web 2.0 applications. The most prominent examples of social software such as Wikipedia, Facebook, or Twitter are ubiquitous and known to everyone. Tom Coates defines social software as “software which supports, extends, or derives added value from human social behavior” (Coates, 2005).

Though the definition is crisp and nicely emphasizes the value lying in IT-enabled social behavior, it is slightly too general. Email, for example, would fall under Coates's definition, but email is generally not regarded as social software. Social software offers a platform, a space where actions are publicly visible and persistent. A platform invites an indeterminate cloud of people to react, whether sooner or later; that's up to them. Email, on the contrary, opens a channel visible to a small group only and imposes an obligation to reply soon. Thus, email is ruled out of the social software equation.

But what about company portals—are they social software? According to McAfee (2009), this depends on how much structure they impose. Most portals we have seen are intended to support a tightly coupled working group in predefined tasks. They show a rich and somewhat rigid structure in terms of workflows, data and document types that can be published, permissions, and roles. This is not social software. Social software minimizes the pre-given structures and leaves the rest to self-organization. This doesn't mean it is unstructured, but the pre-given landmarks are reduced to a few condensation-points where content can accumulate.

McAfee (2009) uses the term emergent to characterize structure that is not preplanned by a mastermind but emerges out of local activities in a community. The structure is, as McAfee puts it, irreducible: It results from network effects and economies of scale but cannot be deduced from the contributions of individual actors.

The tag clouds of the Yahoo! bookmarking service Delicious are an example of an emergent structure. Internet users make their bookmarks publicly available and assign tags to the bookmarks that, in their very subjective view, describe the contents of the bookmarked page. The aggregation of individual tags yields a tag cloud that puts Web content into categories with amazing accuracy and facilitates catchword-based searching. Such categorizations are called folksonomies—taxonomies created by a large crowd of folks.

A common way to systematize the manifold social software applications is by means of a social software triangle, as shown in Figure 8-5. This triangle arranges application types according to their contribution to the three fundamental use cases of social software:

Rank Algorithm - an overview (5)

Figure 8-5. The social software triangle.

Information management. Creating, aggregating, finding, evaluating, and maintaining information.

Identity and network management. Shaping and presenting their own personal profiles, building and cultivating social ties.

Interaction and communication. Direct and indirect communication, conversations, correspondence, information exchange, quotations, and comments.

Another well-established characterization of an information platform as social software is formulated by McAfee (2009, p. 3) and condensed into the acronym SLATES, which stands for:

Search. Users must be able to find what they are looking for.

Links. Users must be able to link content. Links structure the content in an emergent way and rate the relevance of some content by counting the links referring to it (which is Google's page-rank algorithm).

Authoring. Users must be able to contribute their knowledge, experience, insights, facts, opinions, and so on.

Tags. Users must be able to apply tags to categorize content in an emergent way.

Extensions. Users get automatic hints on content that might be interesting to them based on user behavior and preferences (“If you like this, you might like …”)

Signals. Users can subscribe to alerts or feeds so that they are continuously informed about new or updated contents. By using feed aggregators, they also can choose to automatically incorporate the most recent content into their own content.

But let's conclude here. We hope that the reader by now has a good grasp of what Web 2.0 is about, given that this notion cannot be defined with ultimate sharpness. At this point, we also have collected all ingredients to state McAfee's famous definition of Enterprise 2.0. It reads (McAfee, 2006b):

Enterprise 2.0 is the use of emergent social software platforms within companies, or between companies and their partners or customers.

Today, many organizations in which knowledge work is an essential part of the business have adopted social software in some way. Wikis, blogs, and Facebook-like social networks can be found in many companies or institutions. The momentum such tools have is not caused by technology but rather by the social dynamics they elicit. They fill what the sociologist Ronald Burt denotes as structural holes in a knowledge network (Burt, 1995)—missing links between people that otherwise would provide some information benefits.

The observation goes back to Mark Granovetter, who in 1973 published his influential paper The Strength of Weak Ties (Granovetter, 1973). Prior to this landmark paper, studies of social dynamics always emphasized the importance of the closely related group. This is a group of peers, friends, relatives, colleagues, and other people who have strong ties.

But Granovetter argued that for gathering information, solving problems, and inspiration from unfamiliar ideas, the weak ties are at least as important. Strongly tied groups are in a sense islands of static knowledge. The members more or less share the same experiences, convictions, and tenets. Even if there are different opinions or backgrounds, exchange must usually come to a “we agree to disagree” arrangement. Weak ties, on the other hand—relationships to people you know only superficially or by name—are bridges to other knowledge islands and can therefore be strong in acquiring new insights.

Andrew McAfee adopted Granovetter's idea and extended it based on the potential of the new Web 2.0 technologies to what he calls the “bull's eye” (Figure 8-6).5

Rank Algorithm - an overview (6)

Figure 8-6. The bull's eye.

At the center of the bull's eye are strongly tied people. In the working world, these are the people you work with on a regular basis—people who belong to your community of practice. This is the inner circle that has been addressed by traditional IT tools for collaboration by so-called groupware. Web 2.0 does not revolutionize this circle except that it provides lighter accessibility and a less rigorous structure.

The social net grows to the second circle of the bull's eye if you take in people you worked with on a project long time ago, people you just met once on a conference, or people who happen to be in your address book for some other reason— in short, people to whom you have weak ties. These are the ties that become important when you want to know something off the beaten track of your daily routine. They are the people you want to contact when all your buddies are shrugging their shoulders.

Maybe your concern is to introduce an orchestration engine into your ESB6 and use BPEL7 as a means to implement business processes, but nobody in your company has any experience with the pros and cons of doing such a task. Today this might be the point at which you turn to professional networks such as LinkedIn or even Facebook. These tools not only provide an address book and lightweight connectivity; they also make you aware of what your weakly tied counterparts are doing and knowing—they provide what Enterprise 2.0 apologists call network awareness.

But the potential of Web 2.0 goes beyond people's weak ties. The characteristic trait that people contribute content to platforms on a broad basis spawns potential ties between them. If you find something relevant in the blog of some author or a comment or backtrack in your own blog left by some person, it can be the starting point of a conversation or even a collaboration.

The outermost circle labeled “None” on the bull's eye is not there in all versions of this diagram. It denotes the kind of social dynamic that is not centered on ties but rather on an anonymous aggregation of information contributed by people who have no intention of getting in contact with each other. Examples of such dynamics are tagging, rating, and prediction markets.

It is a remarkable feature of the bull's eye that it has backtracks from the outer circles to the inner ones. Since the communities of practice at the inner circle also place their content on public platforms, they become more open to comments or opinions from the outer space, which is one of the major advantages of Web 2.0 with regard to the inner circle.

The public space offers convincing examples of how the means of Web 2.0, which look inconsiderable at first sight, can change the rules of the game. Take, for instance, the way we consume products today—aided by ratings, consumer forums, links to similar products, and so forth. Don't we have reason to suspect that there is some potential to Enterprise 2.0that it implies a major shift in the way people are doing knowledge-intensive work? This is the reason we analyze it here for its applicability to EA.

The introduction of Enterprise 2.0 into an organization is, on the other hand, not a straightforward thing. It is in no case sufficient to install a wiki, some social networking tool, a blogging platform, or other Web 2.0 software, then leave the rest to emergence. The hope that tools alone will make knowledge and communication flourish like a tropical rainforest is illusory. The introduction is to a large extent a question of organizational measures, and we list here just the ones we find most critical:

Enterprise 2.0 platforms must be put into “the flow.” They should not come as an additional burden apart from the daily work processes. This might imply a clear-cut replacement of existing tools.

Enterprise 2.0 platforms must be governed by carefully defined rules and norms. The authoring rules of Wikipedia are a good example of how this guides the direction, contents, and etiquette of such a system.

Enterprise 2.0 platforms should come with a skeleton of predefined structure. This structure provides crystallization points at which contributions of the user community can sediment. A tabula rasa leaves too many contextual considerations to users and is perceived as an obstacle for starting to use the platform.

Contributions to Enterprise 2.0 platforms must be valued. There's a wide range of options here, from somewhat childish goodies8 to incentive-related goal agreements. But the most convincing signs of appreciation are that (a) that authorities or highly respected persons start using the platform and (b) important content is published and discussed there. If the CEO writes a blog but the staff finds nothing more important there than her impressions of the New York marathon, the platform is not earnestly valued.

A thorough discussion of the weaknesses and threats of Enterprise 2.0 is beyond the scope of this book, so we refer the interested reader to Chapter 6 of McAfee (2009) for such a discussion. There is in fact a broad variety of pernicious things that can show up in a social platform—for example:

Inappropriate content. Pictures of doggies and kittens, conversations about cycling training methods, off-topic discussions, and the like.

Flame wars. Discussions for the sake of discussion, insults, airing of dirty laundry, mobbing, troll behavior, etc.

Wrong content. Self-proclaimed experts giving wrong advice, employees uttering opinions that are contradictory to the company position, etc.

We tend to agree with Andrew McAfee that these downsides are, as he puts it, “red herrings”—threats that can be overcome by suitable norms, corrective actions of authorities, or role models. The self-healing capabilities of social platforms may also auto-correct them to some extent.

But the loss of confidentiality and the danger of security leaks are, in our view, more serious concerns than McAfee thinks. Envision a company that has outsourced a large part of its IT function to service providers, business partners, or freelancers. This typically results in a politically delicate conglomerate of conflicting interests. Making the IT strategy, the map of the IT landscape, product decisions, or other highly interesting topics public to all involved parties changes the game between service provider and buyer.

From the buyer's perspective it gets more difficult to pull the wool over the provider's eyes, and from the provider's point of view it is simpler exploit weak spots as cash cows. One might ask in such a constellation whether it is the new openness or the game of “partners” itself that smells fishy, but such shifts in the political equilibrium caused by Enterprise 2.0 must be given due consideration.

The most serious threat about Enterprise 2.0 certainly is that it simply doesn't work due to a lack of participation. By April 2011, there were 3,649,867 English articles in Wikipedia,9 a tremendous number that outreaches other encyclopedias by magnitudes. At the same time, the number of contributors, who are the users with at least 10 edits since their arrival, counts up to 684,737.10 This is an impressive absolute number, but given that there are more than one billion English-speaking Wikipedia users, the number melts down to a tiny relative percentage of about 0.068% of the whole.

If the same percentage applied to an Enterprise 2.0 platform featuring topics regarding the IT function of an enterprise, a theme that does not reach more than 1,000 addressees, the whole idea would collapse to a homepage and publication channel of a handful of expert authors. Thus, countermeasures must be taken to raise the percentage of contributors to a viable level. Some of these measures have been outlined previously—for instance, the advice to put Enterprise 2.0 tools into the flow—but we must have an eye on this aspect in the following sections, where we ponder Enterprise 2.0 as an enabler for EA.

View chapterPurchase book

Read full chapter



Social network analysis: Measuring, mapping, and modeling collections of connections

Derek L. Hansen, ... Itai Himelboim, in Analyzing Social Media Networks with NodeXL (Second Edition), 2020

3.6 Social networks in the era of abundant computation

The widespread adoption of networked communication technologies has significantly expanded the population of people who are both aware of network concepts and interested in network data. Although the idea of networks of connections of people spanning societies and nations was once esoteric, today many people actively manage an explicit social network of friends, contacts, buddies, associates, and addresses that compose their family, social, professional, and civic lives. Facebook posts forwarded from person to person have become a common and visible example of the ways information passes through networks of connected people. The notion of “friends of friends” is now easy to illustrate in the features of social media applications like Facebook and LinkedIn that provide explicitly named “social networking” services. Viral videos and chain emails illustrate the way word of mouth has moved into computer-mediated communication channels. The idea of “six degrees of separation” has moved from the offices of Harvard sociologists to become the dramatic premise of a Broadway play to now appear as an expected feature of services that allow people to browse and connect to their friend’s friends.

As network concepts have entered everyday life, the previously less visible ties and connections that have always woven people together into relationships, cliques, clusters, groups, teams, partnerships, clans, tribes, coalitions, companies, institutions, organizations, nations, and populations have become more apparent. Patterns of information sharing, investment, personal time and attention have always generated network structures, but only recently have these linkages been made plainly visible to a broad population. In the past few decades, the network approach to thinking about the world has expanded beyond the core population of researchers to a wide range of analysts and practitioners who have applied social network methods and perspectives to understand their businesses, communities, markets, and disciplines. Today, because many of us manage many aspects of our social relationships through a computer-networked social world, it is useful for many more people to develop a language and literacy in the ways networks can be described, analyzed, and visualized. Visualizing and analyzing a social network is an increasingly common personal or business interest. The science of networks is a growing topic of interest and attention, with a growing number of courses for graduates and undergraduates, as well as educational materials for a wider audience (e.g., television documentaries).7

The availability of cheaper computing resources and network datasets has enabled a new generation of researchers access to studies of the structures of social relationships at vastly larger scale and detail. Since the late 1960s, as computing resources and network datasets have grown in availability and dropped in cost, researchers began developing tools and concepts that enabled a wider and more sophisticated application of social network analysis.

Advanced topic

Social network analysis research meets the web

As access to electronic networks grew in the 1970s, academic and professional discussions and collaborations began to take place through them. Systems to support the exchange of messages and the growth of discussions and even decisions became a major focus of systems development and the focus of study itself. Freeman and Freeman [32] collected data from the records of the Electronic Information Exchange System (EIES) that itself hosted a discussion among social network researchers. Two relations were recorded: the number of messages sent and acquaintanceship. These systems became the focus of the first systematic research into naturally occurring social media. Even before the Internet, early computer network applications supported the creation of exchanges, discussions, and therefore social networks, built by reply connections among authors.

Early proprietary systems evolved into the public World Wide Web. In the 1990s, the computer scientist Jon Kleinberg created an algorithm called HITS that identified the patterns of links between high-quality web pages. This algorithm later inspired Stanford graduate students Sergey Brin and Larry Page who founded the Google corporation to develop a further refinement they called “Page Rank.” Kleinberg’s work described different locations within a population of linked documents on the World Wide Web: not all documents are equal. On the Web, a document or page can link to another page, forming a complex network of related documents. Some documents contain many pointers to other documents, whereas others have many documents that point at them. These “hubs” and “authorities” defined two broad classes of web pages that offered a path to identifying high-quality content. Links from one page to another are considered to be indicators of value. Refinements of the HITS algorithm made use of eigenvector centralities to implement the page rank algorithm that is the core of the Google “Page Rank” web search ranking method [33].

Network researchers studying social networks and the Internet found that empirical networks often exhibit “small-world properties”: most nodes are not neighbors with each other, but most nodes can be reached from almost every other node in a small number of hops. In the late 1990s, the physicist/sociologist Duncan Watts, working with the mathematician Steven Strogatz, created mathematical models of “small world” networks and contrasted them with purely random networks such as those proposed by Erdos and Renyi [34]. Their model captured the natural properties of social networks far better than those that assumed a purely random or normal distribution of links. Although most people have connections to other people who are local to them, people occasionally have a few connections that link them to another person physically far from the individual. Many of our friends are likely to live or work near us, but a few may be very far away. Even a modest number of these relatively rare far-reaching links can dramatically change the properties of a network, making the widespread transmission of messages much easier. This model significantly improved on earlier ways of thinking about network growth and structure, better approximating the observed structure of naturally occurring social networks. Later researchers have built upon their work to devise models that generate “small world” networks that more closely match empirical networks, helping us to understand how networks may have become the way they are. For example, Barabasi and Albert have developed a family of models of preferential attachment that can generate “scale-free” networks, which are a common feature of social networks [35]. Scale free networks have a power law degree distribution, meaning that there are a few key hubs in a network and many poorly connected vertices. While none of these models perfectly predict real world observed social networks, they provide a method for systematically comparing networks and focus attention on the processes that may have led to the characteristics that we do see in the networks around us.

In the past few years, researchers have begun to study large web-based networks. For example, Leskovec and Horvitz calculated metrics for a graph that includes more than 300 million users of the Microsoft Messenger service [36]. Each user typically had one or more “buddies” with whom he or she might send one or more messages and receive some in return. Buddies often listed their locations, allowing these linkages to be aggregated into a complex map of the world and the flow of conversation around it. Others have reported on the hyperlink network created by web pages hyperlinking to other web pages (e.g., Park and Thelwall [37]). A number of studies have examined the blog network. For example, Adamic and Adar [38] showed how political blogs are divided into two clear clusters with minimal overlap that represent the left and right political populations in the United States. More recently, Kelly and Etling mapped Iran’s blogosphere, identifying more than 20 subcommunities of bloggers who wrote in Farsi for an Iranian audience.8

Another line of research has focused on visualizing social networks. For example, an early influential paper by Heer and Boyd [39] described a tool called Vizster that allowed users to navigate through their friends from a social networking site to explore social connections. Now there are entire conferences dedicated to network visualization, such as the annual Graph Drawing and Network Visualization symposium.

As social media has matured and its ability to shape perceptions has been recognized, it has become the target of concerted efforts at manipulation. Misinformation, disinformation and propaganda have grown in visibility and concern. Claims and counter claims of “fake news” have become common. While initial critical analysis of social media like the work of Eli Parisier focused on its divisive ability to create “filter bubbles,” later work has explored the ways the bubbles can be penetrated and the divisions between them amplified.9 Computer scientist Kate Starbird shows the ways that national, political and commercial groups have collaborated to influence social media by creating messages aimed at making already divided groups more extreme.10 Fil Menczer and collaborators are building tools to identify “bots” and address the propagation of disputed or low value information.11 Information networks are designed to move all kinds of information, not just the true or high value kinds. Paradoxically people often resist abandoning beliefs despite strong evidence and often increase their commitments to beliefs that are challenged. Since a large amount of the information people want to create and consume is explicitly “fiction,” the goal of building machines that can highlight facts and diminish fake information is a challenge. The concept of “tribal epistemology” suggests that most facts exist only for certain people in certain places and times. If multiple truths can coexist it may be better to build maps of the many beliefs and their believers. Rather than seeking to identify the true among the field of untrue material, an alternative approach could map the range of claims and the people and groups who make them. Social media is often thought of as an example of a “marketplace of ideas”.12 If so, social media network analysis could be thought of as a form of accounting software for this marketplace. Without accounting and auditing most markets become rife with fraud and marketplaces for ideas are no exception. With better accounting it should be possible to clearly trace which groups are the source and support for various ideas, claims and beliefs. When independent scrutiny of markets is possible manipulation and collusion can be identified and potentially addressed.

View chapterPurchase book

Read full chapter



Content and link-structure perspective of ranking webpages: A review

Fayyaz Ali, Shah Khusro, in Computer Science Review, 2021

2.3.2 FRank algorithm

The Feature Rank (Frank) algorithm[119] ranks webpages by exploiting its properties in combination with the RankNet machine learning technique. These properties include the PageRank value, total number of visits, the anchor text, the page length (number of words in the page), density of important terms in the page, and features related to the domain of the webpage including e.g.,its PageRank score relative to other webpages, etc.[119]. The algorithm was reported to be efficient than the PageRank algorithm in finding the importance of webpages. The page-related features were found to be the most influential in ranking. Combining multiple features enables the algorithm to capture the preferences of both the webpage creators and users.

View article

Read full article



A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects

Absalom E. Ezugwu, ... Andronicus A. Akinyelu, in Engineering Applications of Artificial Intelligence, 2022

7.1 Web usage

The ever-growing electronic data size originating from web applications’ proliferation has motivated the exploration of hidden information from text content. Nirkhi and Hande (2008) presented a summary on the use of web page rank algorithms (such as Hyperlink Induced Topic Search, page rank) and web page-based clustering algorithms (such as Suffix Tree, Vivisimo and Lingo clustering algorithms) (Nirkhi and Hande, 2008). Similarly, Ivancsy and Kovacs (2006) listed some newer and relevant approaches to web clustering algorithms, such as Fuzzy clustering.

The corresponding web server logs information for each user is accessing a page, including IP address, time of access, file path, browser, and amount of transferred data. Vast volumes of web server log data are generated every day, and they can be used for commercial and non-commercial applications such as designing online shops or providing users with personalized content in digital libraries. Madhulatha (2015) supported this claim that clustering algorithms have been used for library book ordering, document classification, clustering web log data to discover groups of similar access patterns. Another study proposed using the KEA-Means algorithm, which combines the keyphrase extraction algorithm and the K-Means algorithm to generate the number of web documents clusters from a dataset (Ware and Dhawas, 2012). Lin etal. applied a novel algorithm named hierarchical clustering (HSClus) to extract a similarity matrix among pages via in-page and cross-page link structures. This resulted in clusters that hierarchically groups densely linked web pages into semantic clusters (Lin et al., 2009). TangRui et al. (2012) carried out an investigative study to discover the possibilities of applying other nature-inspired-based optimization algorithms: Fireflies, Cuckoos, Bats, and Wolves for performing clustering over Web Intelligence data. Sardar and Ansari (2018a) applied the K-Means algorithm using the MapReduce programming model to the task of document clustering (Sardar and Ansari, 2018a,b).

View article

Read full article



A taxonomy of web prediction algorithms

Josep Domenech, ... Ana Pont, in Expert Systems with Applications, 2012

3.4.2 Semantic approaches

Algorithms in this subcategory tend to capture the user surfing interest from semantic characteristics of visited pages.

A keyword-based semantic prefetching approach is proposed by Xu and Ibrahim (2004). This algorithm lies in the fact that client surfing is often guided by some keywords in the text that surrounds hyperlink definitions (hrefs) in web pages. Therefore, this approach predicts future requests based on semantic preferences of past retrieved objects. To do so, the algorithm ranks the links of the last visited page according to the similarity of the keywords found in their “anchor text” to those in the previously accessed links.

More recently, Georgakis and Li (2006) propose an algorithm that also considers the anchored text around each of the outbound hyperlinks. To assign a weight to each of these links, the algorithm keeps counters of the frequencies of appearance of bigrams for visited and non-visited links.

Davison (2002) proposes an algorithm that predicts the user’s next web page request by conducting an analysis of the text of his/her recently requested web pages. The algorithm compares the text in and around the hypertext anchors of the current page to the text contained in previously visited pages.

View article

Read full article



Towards accurate prediction of epileptic seizures: A review

Elie Bou Assi, ... Mohamad Sawan, in Biomedical Signal Processing and Control, 2017

2.3.4 Feature selection

Since transition from the interictal to the ictal state consists of complex mechanisms, prediction algorithms usually combine several features in an attempt to cover brain dynamics. This results in high dimensional feature spaces. It is thus crucial to select the most discriminative features that will best contribute to identification of the preictal state. Some may be redundant while others can be confounding and degrade classifier performance. Several feature selection methods have been used in seizure prediction studies, such as ReliefF [36], minimum normalized difference of percentiles [66], mDAD [66], forward selection [16], minimum redundancy maximum relevance (mRMR) [31,67], and genetic algorithm (GA) [31,48]. We will discuss the latter 2 methods in this review because of their extensive citation. Table 5 summarizes prominent feature selection and classification methods used in algorithmic seizure prediction studies.

Table 5. Prominent feature selection and classification methods used in algorithmic seizure prediction studies.

AuthorsYearFeature typeSelection methodInitial setRed. setClassifierClassifier typeSS (%)SP (%)/FPR (h−1)Stat. valid
Bandarabadi et al. [66]2012Bivariate linearmRMR4359.1Binary SVM (RBF kernel)Non-linear60.87%0.11h−1No
Bou Assi et al. [31]2015Univariate linearmRMR2242886.07%79.13%No
mRMR & GA590.28%88.53%
Direito et al. [48]2011Univariate linearmRMR200.4413242.56 * %73.44 * %No
GA47.89 * %81.59 * %
Moghim and Come [36]2014Univariate linear and non-linearReliefF20414Multi-class SVM (RBF kernel)91.14%99.55%Yes
Teixeira et al. [21]2014Univariate linear13273.73%0.19h−1Yes
MLP ANN74.17%0.29h−1
RBF ANN69.14%0.42h−1
Bou Assi et al. [31]2015Univariate linearGA22444.2ANFIS82.0%77.6%No
Rabbi et al. [68]2013Univariate and bivariate non-linear480 ϐ %0.46 ϐh−1No
Mirowski et al. [59]2009Bivariate non-linearLasso algorithm6300§VariableCNN71.0%0h−1Yes
Howbert et al. [16]2014Univariate linearForward selection9610Logistic regressionLinear78.66%0.08h−1Yes

SS: sensitivity; SP: specificity; FPR: False Prediction Rate; *: Average calculated over 3 subjects; ϐ: sensitivity at specificity reported for a preictal time of 45min; §: patterns of bivariate features containing 60 consecutive frames (5min) of 105 simultaneous features.

The mRMR algorithm ranks features by criteria of maximum relevance and minimum redundancy, defined in terms of cost function. While mutual information is one of the most common cost functions [23,31,67], several metrics have been proposed, all having the same principle and relying on criteria of similarity. In [48], the mRMR cost function was based on statistical F-testing as a measure of relevance and Pearson’s correlation as a measure of redundancy and used to reduce feature dimensions from 4410 to the first 132 ranked features. Bandarabadi et al. [66] employed the mRMR method based on a mutual information criterion to decrease feature dimensions from 435 to an average of 9.1 features. Recently, our group [31] also adopted the mRMR paradigm combined with a GA for optimal selection of electrode-feature combinations, allowing the selection of the first 28 ranked features out of 224 electrode-feature combinations. Although the number of features was reduced from 224 to 28, predictor performances were comparable using a support vector machine (SVM) with a radial basis function (RBF) kernel. Mean sensitivity was 86.07% and specificity was 79.13% after mRMR compared to mean sensitivity of 84.49% and specificity of 80.11% with the entire feature set.

GAs tend to replicate the principles of biological evolution. Starting from an initial, random population, the strongest recombine to survive and adapt to their external environment. Inspired by natural evolution, GAs generate solutions to optimization problems based on mutation, crossover, inheritance and selection. Several GA types in terms of selection method, genetic structure, and fitness function have been tested in seizure-prediction studies. In [48], genetic structure is a binary string that includes features as well as classifier hyper-parameters. An Elitist Non-dominated Sorting-based GA was included for the selection stage. Ataee et al. [69] proposed a GA-based method that optimizes selection of the best feature vector as well as its optimal window length. GA fitness function was based on Fisher Discriminant Ratio. These authors stated that window length and feature vector should be chosen simultaneously. However, it is not clear if out-of-sample testing was performed in this study. In [31], genetic structure was a binary string in which each feature was a binary number. The fitness function was classification loss according to a K-Nearest-Neighbor classifier. It is important to mention that since GA is an iterative procedure that aims to find an optimal combination of features, the size of the selected subset is not fixed and may vary. Our group [31] showed that, after GA feature selection, reducing their number from an average of 221.2 to 44.2 features, the selected subset of features (SS=87.09%; SP=85.71%) outperformed the whole set (SS=85.49%; SP=80.11%) in terms of sensitivity and specificity. Direito et al. [48], who conducted a comparative study of mRMR, GA, and Recursive feature elimination, concluded that dimensionality reduction improved predictor performance but that the best selection method was patient-specific. As in [48], the optimal selection method was subject-specific with mRMR and GA combination always achieving the best performance. It is noteworthy that ranking selection methods is limited by the requirement of fixing the size of the selected subset of features. Five different sizes of feature subsets were assessed in [66]: 3, 5, 10, 20 and 40. The authors stated that the best size was selected in such a way that performance was close to an optimal predictor (SS=100%; SP=0 Fp/h) in a patient-specific manner. The mean number of selected features was 8.75. Unfortunately, it was not reported if such optimization was performed on the training sample or on the whole set. Selection on the whole set may give overoptimistic results. Direito et al. [48] fixed the number selected at the first 132 highly-ranked features, but it was not clear why they chose this number. Moghim and Come [36] fixed the number of selected features at 14, stating that it would facilitate the benchmarking results of a previous study [70]. In [31], our group fixed the number of electrodes at 2 per feature, then tested a GA to select combination of the most discriminative features. The advantage of a GA is that it does not require fixing the size of the selected feature subset. Feature selection was only performed on the training set. An average number of 5 features was selected when combining mRMR with GA, and rose to 44.2 with the GA only. Interestingly, reducing the number of features to 5 (SS: 90.28%; SP: 88.53%) increased predictor performance with a SVM.

View article

Read full article



Top Articles
Latest Posts
Article information

Author: Geoffrey Lueilwitz

Last Updated: 03/29/2023

Views: 5513

Rating: 5 / 5 (60 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Geoffrey Lueilwitz

Birthday: 1997-03-23

Address: 74183 Thomas Course, Port Micheal, OK 55446-1529

Phone: +13408645881558

Job: Global Representative

Hobby: Sailing, Vehicle restoration, Rowing, Ghost hunting, Scrapbooking, Rugby, Board sports

Introduction: My name is Geoffrey Lueilwitz, I am a zealous, encouraging, sparkling, enchanting, graceful, faithful, nice person who loves writing and wants to share my knowledge and understanding with you.