Isolation forest

Isolation forest

Isolation forest is an unsupervised learning algorithm for anomaly detection that works on the principle of isolating anomalies, instead of the most common techniques of profiling normal points. In statistics, an anomaly (a.k.a. outlier) is an observation or event that deviates so much from other events to arouse suspicion it was generated by a different mean. For example, the graph in Fig.1 represents ingress traffic to a web server, expressed as the number of requests in 3-hours intervals, for a period of one month. It is quite evident by simply looking at the picture that some points (marked with a red circle) are unusually high, to the point of inducing suspect that the web server might have been under attack at that time. On the other hand, the flat segment indicated by the red arrow also seems unusual and might possibly be a sign that the server was down during that time period. Anomalies in a big dataset may follow very complicated patterns, which are difficult to detect "by eye" in the great majority of cases. This is the reason why the field of anomaly detection is well suited for the application of machine learning techniques. The most common techniques employed for anomaly detection are based on the construction of a profile of what is "normal": anomalies are reported as those instances in the dataset that do not conform to the normal profile. Isolation Forest uses a different approach: instead of trying to build a model of normal instances, it explicitly isolates anomalous points in the dataset. The main advantage of this approach is the possibility of exploiting sampling techniques to an extent that is not allowed to the profile-based methods, creating a very fast algorithm with a low memory demand. == History == The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008. The authors took advantage of two quantitative properties of anomalous data points in a sample, that is: they are the minority consisting of fewer instances and they have attribute-values that are very different from those of normal instances Since anomalies are typically few and very different from the other points in the sample, they must be easier to "isolate" compared to normal points. On the basis of this principle, Isolation Forest builds an ensemble of "Isolation Trees" (iTrees) for the data set and marks as anomalies the points that have short average path lengths on the iTrees. In a later paper, published in 2012 the same authors described a set of experiments to prove that iForest: has a low linear time complexity and a small memory requirement is able to deal with high dimensional data with irrelevant attributes can be trained with or without anomalies in the training set can provide detection results with different levels of granularity without re-training In 2013 Zhiguo Ding and Minrui Fei proposed a framework based on iForest to resolve the problem of detecting anomalies in streaming data. More application of iForest to streaming data are described in papers by Swee Chuan Tan et al., G. A. Susto et al. and Yu Weng et al. One of the main problems of the application of iForest to anomaly detection was not with the model itself, but rather in the way the "anomaly score" was computed. This problem was highlighted by Sahand Hariri, Matias Carrasco Kind and Robert J. Brunner in a 2018 paper, wherein they proposed an improved iForest model named Extended Isolation Forest (EIF). In the same paper the authors describe the improvements made to the original model and how they are able to enhance the consistency and reliability of the anomaly score produced for a given data point. == Algorithm == At the basis of the Isolation Forest algorithm there is the tendency of anomalous instances in a dataset to be easier to separate from the rest of the sample (isolate), compared to normal points. In order to isolate a data point the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value for the attribute, between the minimum and maximum values allowed for that attribute. An example of random partitioning in a 2D dataset of normally distributed points is given in Fig. 2 for a non-anomalous point and Fig. 3 for a point that's more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points. From a mathematical point of view, recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point xi in Fig. 2 is greater than the path length of xj in Fig. 3. More formally, let X = { x1, ..., xn } be a set of d-dimensional points and X' ⊂ X a subset of X. An Isolation Tree (iTree) is defined as a data structure with the following properties: for each node T in the Tree, T is either an external-node with no child, or an internal-node with one "test" and exactly two daughter nodes (Tl, Tr) a test at node T consists of an attribute q and a split value p such that the test q < p determines the traversal of a data point to either Tl or Tr. In order to build an iTree, the algorithm recursively divides X' by randomly selecting an attribute q and a split value p, until either (i) the node has only one instance or (ii) all data at the node have the same values. When the iTree is fully grown, each point in X is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smaller path length in the tree, where the path length h(xi) of point x i ∈ X {\displaystyle x_{i}\in X} is defined as the number of edges xi traverses from the root node to get to an external node. A probabilistic explanation of iTree is provided in the iForest original paper. == Properties of Isolation Forest == Sub-sampling: since iForest does not need to isolate all of normal instances, it can frequently ignore the big majority of the training sample. As a consequence, iForest works very well when the sampling size is kept small, a property that is in contrast with the great majority of existing methods, where large sampling size is usually desirable. Swamping: when normal instances are too close to anomalies, the number of partitions required to separate anomalies increases, a phenomena known as swamping, which makes it more difficult for iForest to discriminate between anomalies and normal points. One of the main reasons for swamping is the presence of too many data for the purpose of anomaly detection, which implies one possible solution to the problem is sub-sampling. Since iForest respond very well to sub-sampling in terms of performance, the reduction of the number of points in the sample is also a good way to reduce the effect of swamping. Masking: when the number of anomalies is high it is possible that some of those aggregate in a dense and large cluster, making it more difficult to separate the single anomalies and, in turn, to detect such points as anomalous. Similarly to swamping, this phenomena (known as "masking") is also more likely when the number of points in the sample is big, and can be alleviated through sub-sampling. High Dimensional Data: one of the main limitation to standard, distance-based methods is their inefficiency in dealing with high dimensional datasets:. The main reason for that is, in a high dimensional space every point is equally sparse, so using a distance-based measure of separation is pretty ineffective. Unfortunately, high-dimensional data also affects the detection performance of iForest, but the performance can be vastly improved by adding a features selection test like Kurtosis to reduce the dimensionality of the sample space. Normal Instances Only: iForest performs well even if the training set does not contain any anomalous point, the reason being that iForest describes data distributions in such a way that high values of the path length h(xi) correspond to the presence of data points. As a consequence, the presence of anomalies is pretty irrelevant to iForest's detection performance. == Anomaly Detection with Isolation Forest == Anomaly detection with Isolation Forest is a process composed of two main stages: in the first stage, a training dataset is used to build iTrees as described in previous sections. in the second stage, each instance in test set is passed through the iTrees build in the previous stage, and a proper "anomaly score" is assigned to the instance using the algorithm described below Once all the instances in the test set have been assigned an anomaly score, it is possible to mark as "anomaly" any point whose score is greater than a predefined threshold, which depends on the domain the analysis is being applied to. === Anomaly Score === Th

Lemmy (social network)

Lemmy is free and open-source, social news aggregation software for running self-hosted discussion forums. These hosts, known as "instances", communicate with each other using the ActivityPub protocol. == History == Lemmy was created by the user Dessalines on GitHub in February 2019 and licensed under the Affero General Public License. In a 2020 post, Lemmy's co-creator Dessalines wrote about the origin of the name Lemmy. "It was nameless for a long time, but I wanted to keep with the fediverse tradition of naming projects after animals. I was playing that old-school game Lemmings, and Lemmy (from Motorhead) had passed away that week, and we held a few polls for names, and I went with that." According to the Fediverse statistics sites the-federation.info and fedidb.com, Lemmy had fewer than 100 instances prior to June 2023, but grew to 455 instances with approximately 48,600 monthly active users as of 22 December 2025, with the largest instances being lemmy.world and lemmy.ml, reporting about 14,144 and 1,982 monthly active users, respectively. == Description == Lemmy is made up of a network of individual installations of the Lemmy software that can intercommunicate. This departs from the centralized, monolithic structure of other social media platforms. It has been described as a federated alternative to Reddit. Users on individual instances submit posts with links, text, or pictures to user-created forums for discussion called "communities". Discussion is in the form of threaded comments. Posts and comments can be upvoted or downvoted though the ability to downvote can be disabled by the admins of each instance. Communities are local to each instance, however users may subscribe to communities, create posts and leave comments across instances. Moderation is conducted by the administrators of each instance and moderators of specific communities. Community names begin with c/ in the URL (e.g lemmy.ml/c/simpleliving) and are mentionable using the !community@instance format. On each instance, a front page presents the user with popular posts from several communities. These posts can then be filtered according to origin: posts from the instance the user is on, or from all federated instances. It can also be made to only show posts from communities the user has subscribed to. Lemmy instances are generally supported by donations. == Relations with other social networks == ActivityPub is the protocol used to allow Lemmy instances to operate as a federated social network. It allows users to interact with compatible platforms such as Kbin and Mastodon. In June 2023, following the announcement of Reddit API service changes intended to reduce the use of third-party Reddit clients, community members discussed relocating to Lemmy and other Reddit competitors. Reddit banned a user for promoting switching to Lemmy along with the r/LemmyMigration subreddit as a whole, leading to a Streisand effect after it garnered attention on sites like Hacker News. The ban was reversed a day later. == Third-party software == Prominent third-party Reddit clients Sync and Boost which had shut down due to changes to the pricing of Reddit's API began working on Lemmy clients, with them later relaunching as Sync for Lemmy and Boost for Lemmy. Multiple other apps and browser clients have also been developed.

Dhammin

Dhammin (Arabic: ضمّن) is a political platform that manages candidates' electoral campaigns for the National Assembly, Municipal Council or Cooperative Society councils of Kuwait. The platform was founded by Abdullah Al-Salloum and it is, according to news reports and interviews, the first within the field to apply distributed-systems' methodologies.

Flok (company)

Flok (formerly Loyalblocks) was an American tech startup based in New York City that provides marketing services such as chatbots/AI, customer loyalty programs, mobile apps and CRM services to local businesses. In January 2017, the company was acquired by Wix.com. Around March 2017, Flok ceased regular communication. At some point in 2019 Flok communicated to its customers that it would shut down in March 2020. == Background == Flok was founded in 2011 by Ido Gaver and Eran Kirshenboim and has offices in Tel Aviv, Israel. In May 2013, Flok secured a $9 million Series A Round from General Catalyst Partners with participation from Founder Collective and existing investor Gemini Israel Ventures. In total, Flok has raised over $18 million in venture capital in three rounds. In May 2014, Flok announced a self-service loyalty platform for SMBs to build their own programs with beacon integration. At that time, approximately 40,000 businesses were using the service. In 2016, Flok released a turnkey chatbot service for local businesses, and was featured in AdWeek for developing the first weed bot chatbot for a California cannabis business. == Services == Flok offered an eponymous customer-facing app, that consumers use to receive rewards and deals from partner businesses, and a Flok business app for merchants to manage the platform.

Common data model

A common data model (CDM) can refer to any standardised data model which allows for data and information exchange between different applications and data sources. Common data models aim to standardise logical infrastructure so that related applications can "operate on and share the same data", and can be seen as a way to "organize data from many sources that are in different formats into a standard structure". A common data model has been described as one of the components of a "strong information system". A standardised common data model has also been described as a typical component of a well designed agile application besides a common communication protocol. Providing a single common data model within an organisation is one of the typical tasks of a data warehouse. == Examples of common data models == === Border crossings === X-trans.eu was a cross-border pilot project between the Free State of Bavaria (Germany) and Upper Austria with the aim of developing a faster procedure for the application and approval of cross-border large-capacity transports. The portal was based on a common data model that contained all the information required for approval. === Climate data === The Climate Data Store Common Data Model is a common data model set up by the Copernicus Climate Change Service for harmonising essential climate variables from different sources and data providers. === General information technology === Within service-oriented architecture, S-RAMP is a specification released by HP, IBM, Software AG, TIBCO, and Red Hat which defines a common data model for SOA repositories as well as an interaction protocol to facilitate the use of common tooling and sharing of data. Content Management Interoperability Services (CMIS) is an open standard for inter-operation of different content management systems over the internet, and provides a common data model for typed files and folders used with version control. The NetCDF software libraries for array-oriented scientific data implements a common data model called the NetCDF Java common data model, which consists of three layers built on top of each other to add successively richer semantics. === Health === Within genomic and medical data, the Observational Medical Outcomes Partnership (OMOP) research program established under the U.S. National Institutes of Health has created a common data model for claims and electronic health records which can accommodate data from different sources around the world. PCORnet, which was developed by the Patient-Centered Outcomes Research Institute, is another common data model for health data including electronic health records and patient claims. The Sentinel Common Data Model was initially started as Mini-Sentinel in 2008. It is used by the Sentinel Initiative of the USA's Food and Drug Administration. The Generalized Data Model was first published in 2019. It was designed to be a stand-alone data model as well as to allow for further transformation into other data models (e.g., OMOP, PCORNet, Sentinel). It has a hierarchical structure to flexibly capture relationships among data elements. The JANUS clinical trial data repository also provides a common data model which is based on the SDTM standard to represent clinical data submitted to regulatory agencies, such as tabulation datasets, patient profiles, listings, etc. === Logistics === SX000i is a specification developed jointly by the Aerospace and Defence Industries Association of Europe (ASD) and the American Aerospace Industries Association (AIA) to provide information, guidance and instructions to ensure compatibility and the commonality. The associated SX002D specification contains a common data model. === Microsoft Common Data Model === The Microsoft Common Data Model is a collection of many standardised extensible data schemas with entities, attributes, semantic metadata, and relationships, which represent commonly used concepts and activities in various businesses areas. It is maintained by Microsoft and its partners, and is published on GitHub. Microsoft's Common Data Model is used amongst others in Microsoft Dataverse and with various Microsoft Power Platform and Microsoft Dynamics 365 services. === Rail transport === RailTopoModel is a common data model for the railway sector. === Other === There are many more examples of various common data models for different uses published by different sources.

Viaweb

Viaweb was a web-based application that allowed users to build and host their own online stores with little technical expertise using a web browser. The company was started in July 1995 by Paul Graham, Robert Morris (using the pseudonym "John McArtyem"), and Trevor Blackwell. Graham claims Viaweb was the first application service provider. Viaweb was also unusual for being partially written in the Lisp programming language. The software was originally called Webgen, but another company was using the same name, so the company renamed it to Viaweb, "because it worked via the Web". In 1998, Yahoo! Inc. bought Viaweb for 455,000 shares of Yahoo! capital stock, valued at about $49 million, and renamed it Yahoo! Store. Viaweb's example has been influential in Silicon Valley's entrepreneurial culture, largely due to Graham's widely read essays and his subsequent career as a successful venture capitalist.

IT operations analytics

In the fields of information technology (IT) and systems management, IT operations analytics (ITOA) is an approach or method to retrieve, analyze, and report data for IT operations. ITOA may apply big data analytics to large datasets to produce business insights. In 2014, Gartner predicted its use might increase revenue or reduce costs. By 2017, it predicted that 15% of enterprises will use IT operations analytics technologies. == Definition == IT operations analytics (ITOA) (also known as advanced operational analytics, or IT data analytics) technologies are primarily used to discover complex patterns in high volumes of often "noisy" IT system availability and performance data. Forrester Research defined IT analytics as "The use of mathematical algorithms and other innovations to extract meaningful information from the sea of raw data collected by management and monitoring technologies." Note, ITOA is different than AIOps, which focuses on applying artificial intelligence and machine learning to the applications of ITOA. == History == Operations research as a discipline emerged from the Second World War to improve military efficiency and decision-making on the battlefield. However, only with the emergence of machine learning tech in the early 2000s could an artificially intelligent operational analytics platform actually begin to engage in the high-level pattern recognition that could adequately serve business needs. A critical catalyst towards ITOA development was the rise of Google, which pioneered a predictive analytics model that represented the first attempt to read into patterns of human behavior on the Internet. IT specialists then applied predictive analytics to the IT Industry, coming forward with platforms that can sift through data to generate insights without the need for human intervention. Due to the mainstream embrace of cloud computing and the increasing desire for businesses to adopt more big data practices, the ITOA industry has grown significantly since 2010. A 2016 ExtraHop survey of large and mid-size corporations indicates that 65 percent of the businesses surveyed will seek to integrate their data silos either this year or the next. The current goals of ITOA platforms are to improve the accuracy of their APM services, facilitate better integration with the data, and to enhance their predictive analytics capabilities. == Applications == ITOA systems tend to be used by IT operations teams, and Gartner describes seven applications of ITOA systems: Root cause analysis: The models, structures and pattern descriptions of IT infrastructure or application stack being monitored can help users pinpoint fine-grained and previously unknown root causes of overall system behavior pathologies. Proactive control of service performance and availability: Predicts future system states and the impact of those states on performance. Problem assignment: Determines how problems may be resolved or, at least, direct the results of inferences to the most appropriate individuals, or communities in the enterprise for problem resolution. Service impact analysis: When multiple root causes are known, the analytics system's output is used to determine and rank the relative impact, so that resources can be devoted to correcting the fault in the most timely and cost-effective way possible. Complement best-of-breed technology: The models, structures and pattern descriptions of IT infrastructure or application stack being monitored are used to correct or extend the outputs of other discovery-oriented tools to improve the fidelity of information used in operational tasks (e.g., service dependency maps, application runtime architecture topologies, network topologies). Real time application behavior learning: Learns & correlates the behavior of Application based on user pattern and underlying Infrastructure on various application patterns, create metrics of such correlated patterns and store it for further analysis. Dynamically baselines threshold: Learns behavior of Infrastructure on various application user patterns and determines the Optimal behavior of the Infra and technological components, bench marks and baselines the low and high water mark for the specific environments and dynamically changes the bench mark baselines with the changing infra and user patterns without any manual intervention. == Types == In their Data Growth Demands a Single, Architected IT Operations Analytics Platform, Gartner Research describes five types of analytics technologies: Log analysis Unstructured text indexing, search and inference (UTISI) Topological analysis (TA) Multidimensional database search and analysis (MDSA) Complex operations event processing (COEP) Statistical pattern discovery and recognition (SPDR) == Tools and ITOA platforms == A number of vendors operate in the ITOA space: