Description
Data Warehousing
Data Warehousing
Data warehousing is combining data from multiple and usually varied sources into one comprehensive and easily manipulated database. Common accessing systems of data warehousing include queries, analysis and reporting. Because data warehousing creates one database in the end, the number of sources can be anything you want it to be, provided that the system can handle the volume, of course. The final result, however, is homogeneous data, which can be more easily manipulated. Data warehousing is commonly used by companies to analyze trends over time. In other words, companies may very well use data warehousing to view day-to-day operations, but its primary function is facilitating strategic planning resulting from long-term data overviews. From such overviews, business models, forecasts, and other reports and projections can be made. Routinely, because the data stored in data warehouses is intended to provide more overview-like reporting, the data is read-only. If you want to update the data stored via data warehousing, you will need to build a new query when you are done. This is not to say that data warehousing involves data that is never updated. On the contrary, the data stored in data warehouses is updated all the time. It is the reporting and the analysis that take more of a long-term view. Data warehousing is not the be-all and end-all for storing all of a company's data. Rather, data warehousing is used to house the necessary data for specific analysis. More comprehensive data storage requires different capacities that are more static and less easily manipulated than those used for data warehousing. Data warehousing is typically used by larger companies analyzing larger sets of data for enterprise purposes. Smaller companies wishing to analyze just one subject, for example, usually access data marts, which are much more specific, targeted in their storage, and reporting. Data warehousing often includes smaller amounts of data grouped into data marts. In this way, a larger company might have at its disposal both data warehousing and data marts, allowing users to choose the source and functionality depending on current needs.
Types of Data Warehousing Although you must ensure that, your data warehouse fits your own unique needs but some guidelines can help, you determine the probable complexity of its environment and structure. For that, three types or classifications of data warehousing are mentioned below: 1) Data Warehouse Lite A data warehouse lite is no-frills, bare bones, low-tech approach to providing data that can help with some of your business decision-making. No-frills means that you put together, wherever possible, proven capabilities and tools already within your organization to build your system.
Prof. (Ms.) Avani – Magistrate 1 MIS
Subject Areas and Data Content A data warehouse lite is focused on the reporting or analysis of only one or possibly two subject areas. Suppose that in your job at a wireless division of a telephone company, you analyze the sales of services such as in-network minutes, out-of-network minutes, text messaging, Internet access, and other mobile usage to consumer households. If you build a data, warehouse lite exclusively for this purpose. You have all the necessary information to support your analysis and reporting for the consumer market. You do not have any information about business users‘ and payment history, however, because that information is part of a different subject area. Based on the subject area limitation, a data warehouse lite has just enough data content to satisfy the primary purpose of the environment, but not enough for many unstructured what-if scenarios its users might create. You must choose carefully, therefore, from among the set of all possible data elements and select a manageable subset — elements that, without a doubt, are important to have. This process is the same for any data warehouse Implementation, except that you must be extremely disciplined when you are making decisions about what content to include. Data Sources A data warehouse lite has a limited set of data sources — typically, one to a handful. As part of an overall single-application environment, for example. The data warehouse lite acts as the restructuring agent for the application‘s data to make it more query- and report-friendly. The architecture of a data warehouse lite, as shown in Figure, contains these major component types: ? ? ? A single database contains the warehouse‘s data. That database is led directly from each of the sources providing data to the warehouse. Users access data directly from the warehouse.
Low- tech approach to moving data into a data warehouse lite: database backup tapes or files.
Prof. (Ms.) Avani – Magistrate
2
MIS
The architecture of a data warehouse lite is built around straight-line movement of data.
2) Data Warehouse Deluxe: A standard data warehouse implementation that uses advanced technologies to solve complex business information and analytical needs across a broader user population. You‘ll most likely focus most of your data warehousing-related activities on the data warehouse deluxe environment, as shown in Figure. Data from many different sources converge in these ?real‘ data warehouses, which make available a wealth of architectural options that you can tailor to meet your specific needs.
A data warehouse deluxe has a broader subject- area focus than a data warehouse lite.
Prof. (Ms.) Avani – Magistrate
3
MIS
Subject Areas and Data Content A data warehouse deluxe contains a broad range of related subject areas — everything (or most things) that would follow a natural way of thinking about and analyzing information. In a data-warehouse-deluxe version of the telephone-company example (see the ?Subject areas and data content? section of the data warehouse lite), you will likely find out not only the subject area of consumer wireless services (among other items), but also these elements: ? ? ? ? ? ? ? ? ? Consumer basic calling revenues and volumes Consumer long-distance calling revenues and volumes Consumer wireless calling revenues and volumes Business wireless services Business basic calling revenues and volumes Business long distance calling revenues and volumes Business wireless calling revenues and volumes Internet access (DSL) services Internet revenues and volumes
The subject range is broader than a data warehouse lite for a data warehouse deluxe because ? The user base is broader (more organizations have their people use the data warehouse). ? The scope of any given user‘s queries and reports is broader than just one or two subject areas. For example, a user might run reports comparing trends in add-on services for businesses and consumers to see where to concentrate future sales-andmarketing efforts. When you implement a data warehouse deluxe, you usually need access capabilities (unlike with a data warehouse lite, in addition to simple results reporting. Therefore, although you might be able to use standard reports as a starting point when you‘re deciding what should be in your warehouse, that‘s rarely enough. Follow these steps to thoroughly understand your source systems: 1. Take a complete Inventory or available Information. This inventory is called a source systems analysis. 2. Review each candidate source element and answer these questions: • What data do you need to include in the data warehouse and what should you leave out? • What Information should he summarized and what should be left at the detailed level? • What data should remain in the data warehouse forever, and what data should you purge from the data warehouse alter it has aged? • What else do you need to know about the data before you put it in your data warehouse? This step is one of the most severe tests of how well the IT people and business users get along throughout lite data warehousing project. Data Sources You won‘t he lucky enough to find any single-source environments when you‘re building a data warehouse deluxe.
Prof. (Ms.) Avani – Magistrate 4 MIS
Now, you have a whole new set of — I have to use the word- problems that you must deal with, including the ones in this list: ? ? ? Different encodings for similar Information: For Example-Different sets of customer numbers come from different sources. Data integrity problems across multiple sources: The information in one source is different from the information in another when they should be the same. Different source platforms: As an example, an IBM mainframe that has DB2/MVS databases might contain the data in one of the sources, another IBM mainframe that has VSAM files might have another set of source data, a set of servers might contain data within Oracle databases, and the rest of the source data might all be stored in SQL Server databases on Windows servers.
Although the exact number of data sources depends on the specifics of your implementation, data warehouse deluxes tend to have an average of eight to ten applications and external databases that provide data to the warehouse. Business Intelligence Tools The broad range of subject areas and the wealth of data in a data warehouse deluxe means that you usually have several different ways of looking at that warehouse‘s contents. This list shows the different ways that you can use a data warehouse. ? Simple reporting and querying: Like with data warehouse lite, the purpose of the warehouse deluxe is to ?Tell me what happened.? ? Business analysis: You use the warehouse to ?Tell me what happened — and why.? ? Dashboards and scorecards: In this model, a variety of information is gathered from the data warehouse and that information is made available to users who don‘t want to mess around with the data warehouse — they want to see snapshots of many different things. Its purpose is to ?Tell me a lot of things, but don‘t make me work too hard to get the answers I want. ? Data mining or statistical analysis: In this area, statistical, artificial Intelligence, and related techniques are used to mine through large volumes of data and provide knowledge without users even having to ask specific questions. Its purpose is to ?Tell me something interesting, even though I don‘t know what questions to ask, and also tell me what might happen.? You‘re likely to employ at least three — and perhaps all lour — of these types of data warehouse user-access techniques when you use a data warehouse deluxe. Although tool vendors increasingly try to provide suites of products to handle as many of these different functions as possible, you do have to deal with different products — and so does your user community. 3) Data Warehouse Supreme: A data warehouse that has large-scale data distribution and advanced technologies that can integrate various ?runs the business? systems, improving the overall quality of the data assets across business information analytical needs and transactional needs. Although today‘s state-of-the-art data warehouse typically looks like a complicated data warehouse deluxe, if you read the following sections, you can know what the data warehouse
Prof. (Ms.) Avani – Magistrate
5
MIS
of tomorrow will look like. Few enterprises have ventured in this direction, though due to overall cost and capabilities, it is still rare to find many data warehouse supremes. Subject Areas and Data Content The number of subject areas in a data warehouse supreme is unlimited because the data warehouse is virtual; it isn‘t all contained in a single database or even within multiple databases that you personally load and maintain, instead, only part of your warehouse (probably a small part) is physically located on some data warehouse server; the rest is out there in cyberspace somewhere, accessible through networking capabilities as though it were all part of some physically centralized data warehouse. With a data warehouse supreme, your warehouse users have an infinite number of subject-area possibilities – anything that could possibly be of interest to them. Think of how you use the internet today to access Web sites all over the world — sites that someone else creates and maintains. Now, imagine that each of those sites contains information about some specific area of interest to you — rather than advertising, job ads, electronic storefronts, and whatever else you spend your time surfing the Internet trying to find. Also imagine that you can query and run reports by using the contents of one or more of these sites as your input. That‘s the model of the data warehouse supreme: opening up an unlimited number of possibilities to users. The leading-edge corporations are beginning to pursue and deliver seamless convergence of different types of data: narrative documents, video, image and ordinary data (such as numbers and character information). A data warehouse supreme has all this - all the different types of data that you need to support better business decision-making. In terms of total capacity, a data warehouse supreme is huge; it surpasses today‘s limits. The distribution of the information across many different platforms, much faster and higherperformance networking infrastructure, and increasingly :smarter? database management systems - in addition to, of course, steadily increasing disk storage capacities - create this capacity expansion. Data Sources Because of the wide breadth of subject areas in a data warehouse supreme, it has numerous data sources. The good news: Because many of the sources are external to your own warehousing environment, you aren‘t personally responsible for all the extraction, transformation, and loading to get them into your warehouse. The bad news: Someone has to perform those tasks, and you have little or no control over elements such as quality assurance processes or how frequently the data is refreshed. I have more good news, though: Because the most critical part of a data warehouse supreme is still internally acquired data (the data coming from your Internal applications), from that aspect, the things you do today to make the data warehouse-ready will still be done in the future. Because you populate your data warehouse supreme with multimedia information - in addition to traditional data, such as numeric, alphabetic, and dates - the types of data sources broaden from traditional applications to video servers, web sites, and databases that store documents and text.
Prof. (Ms.) Avani – Magistrate
6
MIS
Business Intelligence Tools As far as I can tell, the Big Four types of business intelligence discussed in the section ?Business Intelligence tools? in the discussion of the data warehouse deluxe, in the earlier page - basic reporting and querying, business analysis, dashboards and scorecards, and data mining - are all part of the data warehouse supreme environment. Of the four, the most significant advances and improvements during the next few years probably will occur with data mining while vendors push enhancements into their products. However, these useraccess methods will be relegated to providing information that will be visualized in other forms. The business intelligence tools will enable users to pull information from the data warehouse supreme and integrate it with a better visualization — for instance, Google Earth or Microsoft Virtual Earth. Such combinations, known as mash-ups, are becoming more prevalent and enable users to see the data from the data warehouse supreme in more realistic forms - not columns on a report, but dots or shadings on a map. The biggest difference between today‘s state-of the art data warehouses and the data warehouse supreme, however, is the dramatically increased use of push technology. By using intelligent agents (?assistants? you program to perform certain functions for you), you can have information fed back to you from the far ends of the Internet-based universe, not to mention your own large data warehouse servers within your own company. Figure illustrates some of the ways in which intelligent agents can help you make very efficient use of data warehousing.
Intelligent agents are an important part of the push technology architecture of a data warehouse supreme. Database A data warehouse supreme that meets these requirements: ? ?
most
likely
Consists
of
a
database
environment
? ?
It‘s distributed across many different platforms. It operates in a location-transparent manner: Users make queries that access data from the appropriate platform without the users having to know the physical location (in much the same way that you access Internet Web sites by name, rather than by network address). It has object-oriented capabilities to store images, videos, and text in addition to the traditional data, such as numeric and date information. Because of dramatically faster performance than current data warehouses, it increasingly permits you to access data directly from transactional databases without having to copy the information to a separate data warehouse database.
Data Extraction, Movement and Loading Here‘s how the extraction, movement, and loading of data occurs in a data warehouse supreme:
Prof. (Ms.) Avani – Magistrate
7
MIS
Data that‘s moved (copied) from a source applications database or file system into a separate database in the data warehouse is handled almost identically to how you perform those tasks in a data warehouse deluxe. The increasing use of Operational Data Stores, or ODSs (real-time availability of analytical data so that you don‘t have to deal with delayed access) means that more messaging occurs between your data sources and your warehouse database. The data source determines when data should be moved into the warehouse environment, so the warehouse doesn‘t have the responsibility to request updates and additions. When new data is inserted into the source database (or existing data is modified or deleted), the appropriate instructions and accompanying data are sent to the warehouse.
Architecture Figure shows an example of what the architecture of a data warehouse supreme might look like, But with all the upcoming technology trends and improvements discussed in the preceding sections, your data warehouse supreme can look like (almost) anything you want.
Sample architecture from a data warehouse supreme (although it can look like just about anything).
MultiDimensional Database (MDDB)
A multidimensional database (MDDB) is a type of database that is optimized for data warehouse applications. Multidimensional databases are frequently created using input from existing relational databases. Whereas a relational database is typically accessed using a Structured Query Language (SQL) query, a multidimensional database allows a user to ask questions like "How many Apples have been sold in Nagpur so far this year?" and similar questions related to summarizing business operations and trends. A multidimensional database - or a multidimensional database management system (MDDBMS) - implies the ability to rapidly process the data in the database so that answers can be generated quickly. A number of vendors provide products that use multidimensional databases. Approaches to how data is stored and the user interface vary.
Prof. (Ms.) Avani – Magistrate 8 MIS
Conceptually, a multidimensional database uses the idea of a data cube to represent the dimensions of data available to a user. For example, "sales" could be viewed in the dimensions of product model, geography, time, or some additional dimension. In this case, "sales" is known as the measure attribute of the data cube and the other dimensions are seen as feature attributes. Additionally, a database creator can define hierarchies and levels within a dimension (for example, state and city levels within a regional hierarchy). Comparison of Relational and Multi-Dimensional Database Structures Relational Databases The relational database model uses a two-dimensional structure of rows and columns to store data, in tables of records corresponding to real-world entities. Tables can be linked by common key values. E.F. Codd first designed this model in 1970, while working for IBM, and it's simplicity revolutionised database usage at the time. Codd's work was in many ways ahead of it's time, as computing power could not support the overheads of his database system (Hasan 1999). In the 1980s the power of computers had grown to the point where these overheads were no longer a problem, and today relational database management systems (DBMS) are available on local desktops, as well as large organisational database management servers. The techniques of entity-relationship (ER) modelling and the structuring of data in normalised tables have become popular with trained database administrators and designers, who routinely use relational DBMS to store huge volumes of organisational data with very high transaction rates. Although deceptively simple to design and operate, relational database simplicity for the enduser does fall down when it comes to running queries. Accessing data from relational databases may require complex joins of many tables and is distinctly non-trivial for untrained end-users, who may be forced to hire IT professionals to structure such queries in a query language, such as SQL. When queries of a writing nature are run, such as INSERT, DELETE and ALTER TABLE, the consequences of getting it wrong are greatly increased when they are employed on a live system environment. Multi-Dimensional Databases In a multi-dimension database system, the data is presented to the user in such a way as to represent a hypercube, or multi-dimensional array, where each individual data value is contained within a cell accessible by multiple indexes. A simple example is given in the previous diagram, Figure 1, where a fictional student exam result database is presented. This database contains three dimensions, namely Result, Student Name and Exam. In this example, an individual student (represented by Student Name) may have their exam results for several exams compared over a period of time, for example a four-year undergraduate course. This ability to present data in such a top level view is unique to multi-dimensional systems, and shows just how powerful these systems can be. Of course a multi-dimensional system is not limited to three dimensions as in the previous example, but when we go beyond that amount, it becomes more difficult to present such structures in a pictorial view. If we stick with the previous example presented in Figure 1; let us now add a fourth dimension called Subject. Let us assume our students study computer science, with subjects in Databases, Programming and Software Engineering. If we imagine this new dimension as
Prof. (Ms.) Avani – Magistrate
9
MIS
being a box containing our previous three dimensions, then we would have three such boxes, namely one for each subject that our students were tested on, as shown in Figure 2.
Figure 1 A typical 3D hypercube Of course this model can be extended to a fifth dimension, and a sixth and so on, until all requirements are met. Nevertheless, what are the advantages of such a system over a traditional relationship system?
Figure 2 A four-dimensional database structure Advantages of Multi-Dimensional Databases over Relational Databases If we look again at our example of a student exam result dataset, there are many reasons why it is more efficient to represent our dataset with a multi-dimensional array rather than a
Prof. (Ms.) Avani – Magistrate 10 MIS
relational table. For example, all similar information is lined up in a single dimension, like Results, so that they can be very quickly summed up to a total or quickly compared to get an immediate idea of how student results are fairing this semester. The multi-dimensional array structure represents a higher level of organisation than the relational table. The structure itself represents a more intelligent view of the data it contains, because our perspectives of this data are embedded directly into the structure as dimensions, as opposed to being placed into fields. For example, if we were to design a fictional relational table for our student results, it might look something like the following diagram: Student Name John Collins John Collins John Collins Larry Wall Larry Wall Larry Wall Linus Torvalds Linus Torvalds Linus Torvalds Exam Databases Programming Operating Systems Databases Programming Operating Systems Databases Programming Operating Systems Result 70 72 60 80 99 70 80 90 99
The structure of this relationship table can tell us nothing of the nature of the contents of these fields, only that there are three fields Student Name, Exam and Result, and there are nine records. If we were to present a three-dimensional view of this data, while adding a third dimension called Semester, it might look something like the following: As you can see from Figure 3, there is no need to have Result as a dimension, because the exam results are going to be contained within the cells of the database structure. Another obvious advantage is the removal of the duplication in the relational table, where each student name was repeated three times for each exam that they participated in. In the multidimensional view, the Student Name and the Exam become dimensions, or in effect indexes into that data, so having duplicates does not make any sense. Notice how all related information neatly lines up in the multi-dimensional view, for example all programming results for John Collins over all three semesters lines up along the z-axis (i.e. from the diagram view perspective, the Semester dimension), while all exam results for John Collins in all subjects line up on the x-axis (the Exam dimension). Programming results for all students line up on the y-axis (the Student Name dimension).
Prof. (Ms.) Avani – Magistrate 11 MIS
Figure 3 From this example, it is clear to see the inherent intelligence in this database structure; in the ER table such views of specific data would not be possible without writing complex SQL queries. Additional Multi-Dimensional Database Advantages Apart from the inherent advantages of using a multi-dimensional array structure, multidimensional databases also contain the following advantages (Kenan):
?
?
?
Enhanced Data Presentation and Navigation: Intuitive spreadsheet-like views of the data are the output of multi-dimensional databases. Such views are difficult to generate in relational systems without the use of complex SQL queries, while others cannot be performed by SQL at all, e.g. top ten exam results. Ease of Maintenance: Multi-dimensional databases are very easy to maintain, because data is stored in the same way as it is viewed, that is according to its fundamental attributes, so no additional computational overhead is required for queries of the database. To compare this to relational systems, where complex indexing and joins may be used that require significant maintenance and overhead. Increased Performance: Multi-dimensional database achieve performance levels that are well in excess of that of relational systems performing similar data storage requirements. These high performance levels encourage and enable On-Line Analytical Processing (OLAP) applications. Performance can be improved in relational systems through database tuning, but the database cannot be tuned for every possible on-the-fly query. In relational systems, tuning is quite specific, therefore decreasing flexibility, and also requires expensive database specialists.
IBM for Data Warehouse InfoSphere Warehouse is the most comprehensive data warehouse solution in the industry; providing you with all of the capabilities necessary to glean maximum return from your most important investment - your information!
Prof. (Ms.) Avani – Magistrate 12 MIS
Data Mining
Data mining uses a relatively large amount of computing power operating on a large set of data to determine regularities and connections between data points. Algorithms that employ techniques from statistics, machine learning and pattern recognition are used to search large databases automatically. Data mining is also known as Knowledge-Discovery in Databases (KDD). Like the term artificial intelligence, data mining is an umbrella term that can be applied to a number of varying activities. In the corporate world, data mining is used most frequently to determine the direction of trends and predict the future. It is employed to build models and decision support systems that give people information they can use. Data mining takes a front-line role in the battle against terrorism. It was supposedly used to determine the leader of the 9/11 attacks. Data mining generally refers to a method used to analyze data from a target source and compose that feedback into useful information. This information typically is used to help an organization cut costs in a particular area, increase revenue, or both. Often facilitated by a data-mining application, its primary objective is to identify and extract patterns contained in a given data set. Data Mining Applications Data mining applications are computer software programs or packages that enable the extraction and identification of patterns from stored data. A data mining application, or data mining tool, is typically a software interface, which interacts with a large database containing customer or other important data. Data mining is widely used by companies and public bodies for such uses as marketing, detection of fraudulent activity, and scientific research. There are wide varieties of data mining applications available, particularly for business uses, such as Customer Relationship Management (CRM). These applications enable marketing managers to understand the behaviours of their customers and to predict the potential behaviour of prospective clients. An example of the kind of task that a data mining technique may assist with is the prediction of future client retention. For example, a company may decide to increase prices, and could use data mining to predict how many customers might be lost for a particular percentage increase in product price. Data mining applications are often structured around the specific needs of an industry sector or even tailored and built for a single organization. This is because the patterns within data may be very specific. Banking data mining applications may, for example, need to track client spending habits in order to detect unusual transactions that might be fraudulent. In another example, a data mining application might be used by a government body to detect associations between individuals who may be involved in terrorist activities. Pattern mining is a term sometimes used to refer to the detection of industry specific patterns in particular types of data. Using this technique, data mining association rules may be detected which can give a likelihood of one characteristic or behaviour being associated with another. An example of a data mining association rule detected by a data mining application
Prof. (Ms.) Avani – Magistrate
13
MIS
analyzing data for a supermarket might be, for example, the knowledge that pasta and sauce are purchased together 90% of the time. The value of data mining applications in business is often estimated to be extremely high. Some businesses have stored large amounts of data over years of operation, yet without an appropriate data mining application are missing out on the very valuable information that may be contained within their existing data. The installation and use of data mining applications can sometimes be an investment that returns dividends quickly by enabling a business to leverage its existing information into more clients, more sales, or greater profits. Data Mining Techniques Most importantly, data mining techniques aim to provide insight that allows for a better understanding of data and its essential features. Companies and organizations can employ many different types of data mining techniques. While they may take a similar approach, all usually strive to meet different goals. The purpose of predictive data mining techniques usually is to identify statistical models or patterns that can be utilized to predict a response of interest. For example, a financial institution might use it to identify which transactions have the highest probability of fraud. This is the most common data mining technique and one that has become an efficient decision-making tool for mid- to large-sized companies. It also has proven effective at predicting customer behaviour, categorizing customer segments, and forecasting various events. Summary models rely on data mining techniques that respond accordingly to summarize data. For instance, an organization might assign airline passengers or credit card transactions into different groups based on their characteristics extracted from the analytical process. This model also can help businesses gain a deeper understanding of their customer base. Association models take into account that certain events can occur together on a regular basis. This could be the simultaneous purchasing of items such as a mouse and keyboard or a sequence of events that led to the failure of a particular hardware device. Association models represent data mining techniques used to identify and characterize these associated occurrences. Network models use data mining techniques to reveal data structures that are in the form of nodes and links. For example, an organized fraud ring might compile a list of stolen credit card numbers, and then turn around and use them to purchase items online. In this illustration, the credit cards and online merchants represent the nodes while the actual transactions act as the links. Spam filtering is arguably a form of data mining, which automatically brings relevant messages to the surface from a chaotic sea of phishing attempts and Viagra pitches. Decision trees are used to filter mountains of data. In a decision tree, all data passes through an entrance node, where it faces a filter that separates the data into streams depending on its characteristics. For example, data about consumer behaviour is likely to be filtered based on demographic factors. Data mining is not primarily about fancy graphs and visualization techniques, but it does employ them to show what it has found. It is known that we can
Prof. (Ms.) Avani – Magistrate
14
MIS
absorb more statistical information visually than verbally and this format for presentation can be very persuasive and powerful if used in the right context. Data mining has many purposes and can be used for both positive and malicious gain. More organizations are coming to discover the benefits of merging data mining techniques to form hybrid models. These powerful combinations often result in applications with superior performance. By integrating the key features of different methods into single hybrid solutions, organizations usually can overcome the limitations of individual strategy systems.
SAN: Storage Area Networks
Definition: A SAN is a dedicated network that is separate from LANs and WANs. It is generally used to connect all the storage resources connected to various servers. It consists of a collection of SAN Hardware and SAN Software; the hardware typically has high inter-connection rates between the various storage devices and the software manages, monitors and configures the SAN. Introduction The main objective of a SAN is to facilitate the exchange of data between operating systems and storage elements. Components of a SAN infrastructure include communication infrastructure, storage elements, computer systems, and a management layer. The connecting elements of a SAN network include routers, gateways, hubs, switches and directors. A SAN removes restrictions on the number of servers that a storage utility can be attached to. The flexible networking of a SAN eliminates the need for physical proximity between the server and the storage devices. Benefits of a SAN include faster transfer of data to the intended destination with minimum utilization of server capacities, access for multiple hosts to several storage devices, independent storage speeds up applications and offers better availability, the management of stored data is easier and centralized and devices are more amenable to scalability. SANs have led to the development of several new methods for attaching servers to storage devices such as optical jukeboxes, tape libraries, and disk arrays. The high-speed transfer of data via a SAN can occur in the following ways – 1. Connecting server/s to storage devices – This is the most commonly used method and allows for the access of a storage device by servers either serially or simultaneously. 2. Connection between servers – SAN enables high-volume transfer of data between servers. 3. Connection between storage devices – Useful for moving data between storage devices without eating into server capacities, which can then be utilized for other activities. The Need for SAN The I/O bandwidth of the networks that were earlier used to connect the data storage devices and the processors was not commensurate with the capacities of the disk arrays and computers that utilized the data stored in them. The access to data is further complicated by the different database software run on different platforms. Managing different file systems and data formats requires trained manpower. The traditionally distributed storage has been a huge drain on management resources and inefficient as well in terms of capacity utilization of hardware resources. Scalability is also an issue when disk capacity is tied down to a single
Prof. (Ms.) Avani – Magistrate 15 MIS
server or client. Sharing of data often requires creating duplicate copies, moving these copies slows down the LAN/WAN and often co-ordination between applications such as BI, CRM, and ERP that are spread over the entire organization becomes very difficult. SAN Infrastructure SAN topologies are predominantly developed using fiber channels. Fiber channel is an open technical standard developed for networking and is especially useful for handling storage communications as it offers flexible connectivity and fast access to data. Optical fibers are used for long-distance networking and copper cable links are preferred for shorter distances due to their lower cost. Fiber channels can support different protocols and a large number of devices, a quality very desirable in any networking solution. Storage Storage devices commonly connected through SAN include disk systems and tape systems. Disk systems offer simple integration as the I/O control is centralized. Disk systems are classified as Just A Bunch Of Disks (JBOD) and Redundant Array of Independent Disks (RAID). Disks in a JBOD are treated as individual storage devices by the applications they are connected to. RAIDs are treated as a single device that has a higher fault tolerance. An array of disks can be made to behave as a JBOD or RAID depending upon the performance requirements of a SAN. Disk systems are preferred for online data storage because of their high performance. Tape systems make use of tapes arranged serially; parallel arrangements are not possible. Tape systems consist of drives, autoloaders, and libraries. Tape drives connect the tapes to the devices and enable the reading/writing from and to the tapes. Tape autoloaders are tape drives that perform the function of auto backup; they are used for devices that generate a lot of data constantly. Tape libraries are autonomous sets of tape drives and autoloaders. They are used in situations where the storage capacity required is very high. Tape systems are used for offline storage because of their cost efficiency. Benefits of SAN One of the chief benefits of SAN is that it simplifies the network infrastructure and makes it easier to manage. This it does by means of consolidation, virtualization, automation, and integration. Consolidation aims at centralizing the storage to improve scalability, reducing infrastructure complexity, and increasing efficiency. Virtualization helps improve availability and reduces costs as it offers a holistic view of storage components. Automation of routine tasks allows the administrators to focus on critical tasks. Automation also improves responsiveness. Integration helps organization furnish users with the desired information in a more systematic manner. A SAN makes information lifecycle management easier because of the integrated view of the data that it offers. Perhaps the biggest benefit of a SAN is that it complements expensive business applications that demand instant and real-time information. ERP and CRM systems can fulfil their business promise only if the right type of data is made available at the right time to the right person. To this end, a SAN is most useful and appropriate IBM IBM SAN products and solutions provide integrated SMB and enterprise SAN solutions with multi-protocol local, campus, metropolitan and global storage networking.
Prof. (Ms.) Avani – Magistrate
16
MIS
doc_490008990.docx
Data Warehousing
Data Warehousing
Data warehousing is combining data from multiple and usually varied sources into one comprehensive and easily manipulated database. Common accessing systems of data warehousing include queries, analysis and reporting. Because data warehousing creates one database in the end, the number of sources can be anything you want it to be, provided that the system can handle the volume, of course. The final result, however, is homogeneous data, which can be more easily manipulated. Data warehousing is commonly used by companies to analyze trends over time. In other words, companies may very well use data warehousing to view day-to-day operations, but its primary function is facilitating strategic planning resulting from long-term data overviews. From such overviews, business models, forecasts, and other reports and projections can be made. Routinely, because the data stored in data warehouses is intended to provide more overview-like reporting, the data is read-only. If you want to update the data stored via data warehousing, you will need to build a new query when you are done. This is not to say that data warehousing involves data that is never updated. On the contrary, the data stored in data warehouses is updated all the time. It is the reporting and the analysis that take more of a long-term view. Data warehousing is not the be-all and end-all for storing all of a company's data. Rather, data warehousing is used to house the necessary data for specific analysis. More comprehensive data storage requires different capacities that are more static and less easily manipulated than those used for data warehousing. Data warehousing is typically used by larger companies analyzing larger sets of data for enterprise purposes. Smaller companies wishing to analyze just one subject, for example, usually access data marts, which are much more specific, targeted in their storage, and reporting. Data warehousing often includes smaller amounts of data grouped into data marts. In this way, a larger company might have at its disposal both data warehousing and data marts, allowing users to choose the source and functionality depending on current needs.
Types of Data Warehousing Although you must ensure that, your data warehouse fits your own unique needs but some guidelines can help, you determine the probable complexity of its environment and structure. For that, three types or classifications of data warehousing are mentioned below: 1) Data Warehouse Lite A data warehouse lite is no-frills, bare bones, low-tech approach to providing data that can help with some of your business decision-making. No-frills means that you put together, wherever possible, proven capabilities and tools already within your organization to build your system.
Prof. (Ms.) Avani – Magistrate 1 MIS
Subject Areas and Data Content A data warehouse lite is focused on the reporting or analysis of only one or possibly two subject areas. Suppose that in your job at a wireless division of a telephone company, you analyze the sales of services such as in-network minutes, out-of-network minutes, text messaging, Internet access, and other mobile usage to consumer households. If you build a data, warehouse lite exclusively for this purpose. You have all the necessary information to support your analysis and reporting for the consumer market. You do not have any information about business users‘ and payment history, however, because that information is part of a different subject area. Based on the subject area limitation, a data warehouse lite has just enough data content to satisfy the primary purpose of the environment, but not enough for many unstructured what-if scenarios its users might create. You must choose carefully, therefore, from among the set of all possible data elements and select a manageable subset — elements that, without a doubt, are important to have. This process is the same for any data warehouse Implementation, except that you must be extremely disciplined when you are making decisions about what content to include. Data Sources A data warehouse lite has a limited set of data sources — typically, one to a handful. As part of an overall single-application environment, for example. The data warehouse lite acts as the restructuring agent for the application‘s data to make it more query- and report-friendly. The architecture of a data warehouse lite, as shown in Figure, contains these major component types: ? ? ? A single database contains the warehouse‘s data. That database is led directly from each of the sources providing data to the warehouse. Users access data directly from the warehouse.
Low- tech approach to moving data into a data warehouse lite: database backup tapes or files.
Prof. (Ms.) Avani – Magistrate
2
MIS
The architecture of a data warehouse lite is built around straight-line movement of data.
2) Data Warehouse Deluxe: A standard data warehouse implementation that uses advanced technologies to solve complex business information and analytical needs across a broader user population. You‘ll most likely focus most of your data warehousing-related activities on the data warehouse deluxe environment, as shown in Figure. Data from many different sources converge in these ?real‘ data warehouses, which make available a wealth of architectural options that you can tailor to meet your specific needs.
A data warehouse deluxe has a broader subject- area focus than a data warehouse lite.
Prof. (Ms.) Avani – Magistrate
3
MIS
Subject Areas and Data Content A data warehouse deluxe contains a broad range of related subject areas — everything (or most things) that would follow a natural way of thinking about and analyzing information. In a data-warehouse-deluxe version of the telephone-company example (see the ?Subject areas and data content? section of the data warehouse lite), you will likely find out not only the subject area of consumer wireless services (among other items), but also these elements: ? ? ? ? ? ? ? ? ? Consumer basic calling revenues and volumes Consumer long-distance calling revenues and volumes Consumer wireless calling revenues and volumes Business wireless services Business basic calling revenues and volumes Business long distance calling revenues and volumes Business wireless calling revenues and volumes Internet access (DSL) services Internet revenues and volumes
The subject range is broader than a data warehouse lite for a data warehouse deluxe because ? The user base is broader (more organizations have their people use the data warehouse). ? The scope of any given user‘s queries and reports is broader than just one or two subject areas. For example, a user might run reports comparing trends in add-on services for businesses and consumers to see where to concentrate future sales-andmarketing efforts. When you implement a data warehouse deluxe, you usually need access capabilities (unlike with a data warehouse lite, in addition to simple results reporting. Therefore, although you might be able to use standard reports as a starting point when you‘re deciding what should be in your warehouse, that‘s rarely enough. Follow these steps to thoroughly understand your source systems: 1. Take a complete Inventory or available Information. This inventory is called a source systems analysis. 2. Review each candidate source element and answer these questions: • What data do you need to include in the data warehouse and what should you leave out? • What Information should he summarized and what should be left at the detailed level? • What data should remain in the data warehouse forever, and what data should you purge from the data warehouse alter it has aged? • What else do you need to know about the data before you put it in your data warehouse? This step is one of the most severe tests of how well the IT people and business users get along throughout lite data warehousing project. Data Sources You won‘t he lucky enough to find any single-source environments when you‘re building a data warehouse deluxe.
Prof. (Ms.) Avani – Magistrate 4 MIS
Now, you have a whole new set of — I have to use the word- problems that you must deal with, including the ones in this list: ? ? ? Different encodings for similar Information: For Example-Different sets of customer numbers come from different sources. Data integrity problems across multiple sources: The information in one source is different from the information in another when they should be the same. Different source platforms: As an example, an IBM mainframe that has DB2/MVS databases might contain the data in one of the sources, another IBM mainframe that has VSAM files might have another set of source data, a set of servers might contain data within Oracle databases, and the rest of the source data might all be stored in SQL Server databases on Windows servers.
Although the exact number of data sources depends on the specifics of your implementation, data warehouse deluxes tend to have an average of eight to ten applications and external databases that provide data to the warehouse. Business Intelligence Tools The broad range of subject areas and the wealth of data in a data warehouse deluxe means that you usually have several different ways of looking at that warehouse‘s contents. This list shows the different ways that you can use a data warehouse. ? Simple reporting and querying: Like with data warehouse lite, the purpose of the warehouse deluxe is to ?Tell me what happened.? ? Business analysis: You use the warehouse to ?Tell me what happened — and why.? ? Dashboards and scorecards: In this model, a variety of information is gathered from the data warehouse and that information is made available to users who don‘t want to mess around with the data warehouse — they want to see snapshots of many different things. Its purpose is to ?Tell me a lot of things, but don‘t make me work too hard to get the answers I want. ? Data mining or statistical analysis: In this area, statistical, artificial Intelligence, and related techniques are used to mine through large volumes of data and provide knowledge without users even having to ask specific questions. Its purpose is to ?Tell me something interesting, even though I don‘t know what questions to ask, and also tell me what might happen.? You‘re likely to employ at least three — and perhaps all lour — of these types of data warehouse user-access techniques when you use a data warehouse deluxe. Although tool vendors increasingly try to provide suites of products to handle as many of these different functions as possible, you do have to deal with different products — and so does your user community. 3) Data Warehouse Supreme: A data warehouse that has large-scale data distribution and advanced technologies that can integrate various ?runs the business? systems, improving the overall quality of the data assets across business information analytical needs and transactional needs. Although today‘s state-of-the-art data warehouse typically looks like a complicated data warehouse deluxe, if you read the following sections, you can know what the data warehouse
Prof. (Ms.) Avani – Magistrate
5
MIS
of tomorrow will look like. Few enterprises have ventured in this direction, though due to overall cost and capabilities, it is still rare to find many data warehouse supremes. Subject Areas and Data Content The number of subject areas in a data warehouse supreme is unlimited because the data warehouse is virtual; it isn‘t all contained in a single database or even within multiple databases that you personally load and maintain, instead, only part of your warehouse (probably a small part) is physically located on some data warehouse server; the rest is out there in cyberspace somewhere, accessible through networking capabilities as though it were all part of some physically centralized data warehouse. With a data warehouse supreme, your warehouse users have an infinite number of subject-area possibilities – anything that could possibly be of interest to them. Think of how you use the internet today to access Web sites all over the world — sites that someone else creates and maintains. Now, imagine that each of those sites contains information about some specific area of interest to you — rather than advertising, job ads, electronic storefronts, and whatever else you spend your time surfing the Internet trying to find. Also imagine that you can query and run reports by using the contents of one or more of these sites as your input. That‘s the model of the data warehouse supreme: opening up an unlimited number of possibilities to users. The leading-edge corporations are beginning to pursue and deliver seamless convergence of different types of data: narrative documents, video, image and ordinary data (such as numbers and character information). A data warehouse supreme has all this - all the different types of data that you need to support better business decision-making. In terms of total capacity, a data warehouse supreme is huge; it surpasses today‘s limits. The distribution of the information across many different platforms, much faster and higherperformance networking infrastructure, and increasingly :smarter? database management systems - in addition to, of course, steadily increasing disk storage capacities - create this capacity expansion. Data Sources Because of the wide breadth of subject areas in a data warehouse supreme, it has numerous data sources. The good news: Because many of the sources are external to your own warehousing environment, you aren‘t personally responsible for all the extraction, transformation, and loading to get them into your warehouse. The bad news: Someone has to perform those tasks, and you have little or no control over elements such as quality assurance processes or how frequently the data is refreshed. I have more good news, though: Because the most critical part of a data warehouse supreme is still internally acquired data (the data coming from your Internal applications), from that aspect, the things you do today to make the data warehouse-ready will still be done in the future. Because you populate your data warehouse supreme with multimedia information - in addition to traditional data, such as numeric, alphabetic, and dates - the types of data sources broaden from traditional applications to video servers, web sites, and databases that store documents and text.
Prof. (Ms.) Avani – Magistrate
6
MIS
Business Intelligence Tools As far as I can tell, the Big Four types of business intelligence discussed in the section ?Business Intelligence tools? in the discussion of the data warehouse deluxe, in the earlier page - basic reporting and querying, business analysis, dashboards and scorecards, and data mining - are all part of the data warehouse supreme environment. Of the four, the most significant advances and improvements during the next few years probably will occur with data mining while vendors push enhancements into their products. However, these useraccess methods will be relegated to providing information that will be visualized in other forms. The business intelligence tools will enable users to pull information from the data warehouse supreme and integrate it with a better visualization — for instance, Google Earth or Microsoft Virtual Earth. Such combinations, known as mash-ups, are becoming more prevalent and enable users to see the data from the data warehouse supreme in more realistic forms - not columns on a report, but dots or shadings on a map. The biggest difference between today‘s state-of the art data warehouses and the data warehouse supreme, however, is the dramatically increased use of push technology. By using intelligent agents (?assistants? you program to perform certain functions for you), you can have information fed back to you from the far ends of the Internet-based universe, not to mention your own large data warehouse servers within your own company. Figure illustrates some of the ways in which intelligent agents can help you make very efficient use of data warehousing.
Intelligent agents are an important part of the push technology architecture of a data warehouse supreme. Database A data warehouse supreme that meets these requirements: ? ?
most
likely
Consists
of
a
database
environment
? ?
It‘s distributed across many different platforms. It operates in a location-transparent manner: Users make queries that access data from the appropriate platform without the users having to know the physical location (in much the same way that you access Internet Web sites by name, rather than by network address). It has object-oriented capabilities to store images, videos, and text in addition to the traditional data, such as numeric and date information. Because of dramatically faster performance than current data warehouses, it increasingly permits you to access data directly from transactional databases without having to copy the information to a separate data warehouse database.
Data Extraction, Movement and Loading Here‘s how the extraction, movement, and loading of data occurs in a data warehouse supreme:
Prof. (Ms.) Avani – Magistrate
7
MIS
Data that‘s moved (copied) from a source applications database or file system into a separate database in the data warehouse is handled almost identically to how you perform those tasks in a data warehouse deluxe. The increasing use of Operational Data Stores, or ODSs (real-time availability of analytical data so that you don‘t have to deal with delayed access) means that more messaging occurs between your data sources and your warehouse database. The data source determines when data should be moved into the warehouse environment, so the warehouse doesn‘t have the responsibility to request updates and additions. When new data is inserted into the source database (or existing data is modified or deleted), the appropriate instructions and accompanying data are sent to the warehouse.
Architecture Figure shows an example of what the architecture of a data warehouse supreme might look like, But with all the upcoming technology trends and improvements discussed in the preceding sections, your data warehouse supreme can look like (almost) anything you want.
Sample architecture from a data warehouse supreme (although it can look like just about anything).
MultiDimensional Database (MDDB)
A multidimensional database (MDDB) is a type of database that is optimized for data warehouse applications. Multidimensional databases are frequently created using input from existing relational databases. Whereas a relational database is typically accessed using a Structured Query Language (SQL) query, a multidimensional database allows a user to ask questions like "How many Apples have been sold in Nagpur so far this year?" and similar questions related to summarizing business operations and trends. A multidimensional database - or a multidimensional database management system (MDDBMS) - implies the ability to rapidly process the data in the database so that answers can be generated quickly. A number of vendors provide products that use multidimensional databases. Approaches to how data is stored and the user interface vary.
Prof. (Ms.) Avani – Magistrate 8 MIS
Conceptually, a multidimensional database uses the idea of a data cube to represent the dimensions of data available to a user. For example, "sales" could be viewed in the dimensions of product model, geography, time, or some additional dimension. In this case, "sales" is known as the measure attribute of the data cube and the other dimensions are seen as feature attributes. Additionally, a database creator can define hierarchies and levels within a dimension (for example, state and city levels within a regional hierarchy). Comparison of Relational and Multi-Dimensional Database Structures Relational Databases The relational database model uses a two-dimensional structure of rows and columns to store data, in tables of records corresponding to real-world entities. Tables can be linked by common key values. E.F. Codd first designed this model in 1970, while working for IBM, and it's simplicity revolutionised database usage at the time. Codd's work was in many ways ahead of it's time, as computing power could not support the overheads of his database system (Hasan 1999). In the 1980s the power of computers had grown to the point where these overheads were no longer a problem, and today relational database management systems (DBMS) are available on local desktops, as well as large organisational database management servers. The techniques of entity-relationship (ER) modelling and the structuring of data in normalised tables have become popular with trained database administrators and designers, who routinely use relational DBMS to store huge volumes of organisational data with very high transaction rates. Although deceptively simple to design and operate, relational database simplicity for the enduser does fall down when it comes to running queries. Accessing data from relational databases may require complex joins of many tables and is distinctly non-trivial for untrained end-users, who may be forced to hire IT professionals to structure such queries in a query language, such as SQL. When queries of a writing nature are run, such as INSERT, DELETE and ALTER TABLE, the consequences of getting it wrong are greatly increased when they are employed on a live system environment. Multi-Dimensional Databases In a multi-dimension database system, the data is presented to the user in such a way as to represent a hypercube, or multi-dimensional array, where each individual data value is contained within a cell accessible by multiple indexes. A simple example is given in the previous diagram, Figure 1, where a fictional student exam result database is presented. This database contains three dimensions, namely Result, Student Name and Exam. In this example, an individual student (represented by Student Name) may have their exam results for several exams compared over a period of time, for example a four-year undergraduate course. This ability to present data in such a top level view is unique to multi-dimensional systems, and shows just how powerful these systems can be. Of course a multi-dimensional system is not limited to three dimensions as in the previous example, but when we go beyond that amount, it becomes more difficult to present such structures in a pictorial view. If we stick with the previous example presented in Figure 1; let us now add a fourth dimension called Subject. Let us assume our students study computer science, with subjects in Databases, Programming and Software Engineering. If we imagine this new dimension as
Prof. (Ms.) Avani – Magistrate
9
MIS
being a box containing our previous three dimensions, then we would have three such boxes, namely one for each subject that our students were tested on, as shown in Figure 2.
Figure 1 A typical 3D hypercube Of course this model can be extended to a fifth dimension, and a sixth and so on, until all requirements are met. Nevertheless, what are the advantages of such a system over a traditional relationship system?
Figure 2 A four-dimensional database structure Advantages of Multi-Dimensional Databases over Relational Databases If we look again at our example of a student exam result dataset, there are many reasons why it is more efficient to represent our dataset with a multi-dimensional array rather than a
Prof. (Ms.) Avani – Magistrate 10 MIS
relational table. For example, all similar information is lined up in a single dimension, like Results, so that they can be very quickly summed up to a total or quickly compared to get an immediate idea of how student results are fairing this semester. The multi-dimensional array structure represents a higher level of organisation than the relational table. The structure itself represents a more intelligent view of the data it contains, because our perspectives of this data are embedded directly into the structure as dimensions, as opposed to being placed into fields. For example, if we were to design a fictional relational table for our student results, it might look something like the following diagram: Student Name John Collins John Collins John Collins Larry Wall Larry Wall Larry Wall Linus Torvalds Linus Torvalds Linus Torvalds Exam Databases Programming Operating Systems Databases Programming Operating Systems Databases Programming Operating Systems Result 70 72 60 80 99 70 80 90 99
The structure of this relationship table can tell us nothing of the nature of the contents of these fields, only that there are three fields Student Name, Exam and Result, and there are nine records. If we were to present a three-dimensional view of this data, while adding a third dimension called Semester, it might look something like the following: As you can see from Figure 3, there is no need to have Result as a dimension, because the exam results are going to be contained within the cells of the database structure. Another obvious advantage is the removal of the duplication in the relational table, where each student name was repeated three times for each exam that they participated in. In the multidimensional view, the Student Name and the Exam become dimensions, or in effect indexes into that data, so having duplicates does not make any sense. Notice how all related information neatly lines up in the multi-dimensional view, for example all programming results for John Collins over all three semesters lines up along the z-axis (i.e. from the diagram view perspective, the Semester dimension), while all exam results for John Collins in all subjects line up on the x-axis (the Exam dimension). Programming results for all students line up on the y-axis (the Student Name dimension).
Prof. (Ms.) Avani – Magistrate 11 MIS
Figure 3 From this example, it is clear to see the inherent intelligence in this database structure; in the ER table such views of specific data would not be possible without writing complex SQL queries. Additional Multi-Dimensional Database Advantages Apart from the inherent advantages of using a multi-dimensional array structure, multidimensional databases also contain the following advantages (Kenan):
?
?
?
Enhanced Data Presentation and Navigation: Intuitive spreadsheet-like views of the data are the output of multi-dimensional databases. Such views are difficult to generate in relational systems without the use of complex SQL queries, while others cannot be performed by SQL at all, e.g. top ten exam results. Ease of Maintenance: Multi-dimensional databases are very easy to maintain, because data is stored in the same way as it is viewed, that is according to its fundamental attributes, so no additional computational overhead is required for queries of the database. To compare this to relational systems, where complex indexing and joins may be used that require significant maintenance and overhead. Increased Performance: Multi-dimensional database achieve performance levels that are well in excess of that of relational systems performing similar data storage requirements. These high performance levels encourage and enable On-Line Analytical Processing (OLAP) applications. Performance can be improved in relational systems through database tuning, but the database cannot be tuned for every possible on-the-fly query. In relational systems, tuning is quite specific, therefore decreasing flexibility, and also requires expensive database specialists.
IBM for Data Warehouse InfoSphere Warehouse is the most comprehensive data warehouse solution in the industry; providing you with all of the capabilities necessary to glean maximum return from your most important investment - your information!
Prof. (Ms.) Avani – Magistrate 12 MIS
Data Mining
Data mining uses a relatively large amount of computing power operating on a large set of data to determine regularities and connections between data points. Algorithms that employ techniques from statistics, machine learning and pattern recognition are used to search large databases automatically. Data mining is also known as Knowledge-Discovery in Databases (KDD). Like the term artificial intelligence, data mining is an umbrella term that can be applied to a number of varying activities. In the corporate world, data mining is used most frequently to determine the direction of trends and predict the future. It is employed to build models and decision support systems that give people information they can use. Data mining takes a front-line role in the battle against terrorism. It was supposedly used to determine the leader of the 9/11 attacks. Data mining generally refers to a method used to analyze data from a target source and compose that feedback into useful information. This information typically is used to help an organization cut costs in a particular area, increase revenue, or both. Often facilitated by a data-mining application, its primary objective is to identify and extract patterns contained in a given data set. Data Mining Applications Data mining applications are computer software programs or packages that enable the extraction and identification of patterns from stored data. A data mining application, or data mining tool, is typically a software interface, which interacts with a large database containing customer or other important data. Data mining is widely used by companies and public bodies for such uses as marketing, detection of fraudulent activity, and scientific research. There are wide varieties of data mining applications available, particularly for business uses, such as Customer Relationship Management (CRM). These applications enable marketing managers to understand the behaviours of their customers and to predict the potential behaviour of prospective clients. An example of the kind of task that a data mining technique may assist with is the prediction of future client retention. For example, a company may decide to increase prices, and could use data mining to predict how many customers might be lost for a particular percentage increase in product price. Data mining applications are often structured around the specific needs of an industry sector or even tailored and built for a single organization. This is because the patterns within data may be very specific. Banking data mining applications may, for example, need to track client spending habits in order to detect unusual transactions that might be fraudulent. In another example, a data mining application might be used by a government body to detect associations between individuals who may be involved in terrorist activities. Pattern mining is a term sometimes used to refer to the detection of industry specific patterns in particular types of data. Using this technique, data mining association rules may be detected which can give a likelihood of one characteristic or behaviour being associated with another. An example of a data mining association rule detected by a data mining application
Prof. (Ms.) Avani – Magistrate
13
MIS
analyzing data for a supermarket might be, for example, the knowledge that pasta and sauce are purchased together 90% of the time. The value of data mining applications in business is often estimated to be extremely high. Some businesses have stored large amounts of data over years of operation, yet without an appropriate data mining application are missing out on the very valuable information that may be contained within their existing data. The installation and use of data mining applications can sometimes be an investment that returns dividends quickly by enabling a business to leverage its existing information into more clients, more sales, or greater profits. Data Mining Techniques Most importantly, data mining techniques aim to provide insight that allows for a better understanding of data and its essential features. Companies and organizations can employ many different types of data mining techniques. While they may take a similar approach, all usually strive to meet different goals. The purpose of predictive data mining techniques usually is to identify statistical models or patterns that can be utilized to predict a response of interest. For example, a financial institution might use it to identify which transactions have the highest probability of fraud. This is the most common data mining technique and one that has become an efficient decision-making tool for mid- to large-sized companies. It also has proven effective at predicting customer behaviour, categorizing customer segments, and forecasting various events. Summary models rely on data mining techniques that respond accordingly to summarize data. For instance, an organization might assign airline passengers or credit card transactions into different groups based on their characteristics extracted from the analytical process. This model also can help businesses gain a deeper understanding of their customer base. Association models take into account that certain events can occur together on a regular basis. This could be the simultaneous purchasing of items such as a mouse and keyboard or a sequence of events that led to the failure of a particular hardware device. Association models represent data mining techniques used to identify and characterize these associated occurrences. Network models use data mining techniques to reveal data structures that are in the form of nodes and links. For example, an organized fraud ring might compile a list of stolen credit card numbers, and then turn around and use them to purchase items online. In this illustration, the credit cards and online merchants represent the nodes while the actual transactions act as the links. Spam filtering is arguably a form of data mining, which automatically brings relevant messages to the surface from a chaotic sea of phishing attempts and Viagra pitches. Decision trees are used to filter mountains of data. In a decision tree, all data passes through an entrance node, where it faces a filter that separates the data into streams depending on its characteristics. For example, data about consumer behaviour is likely to be filtered based on demographic factors. Data mining is not primarily about fancy graphs and visualization techniques, but it does employ them to show what it has found. It is known that we can
Prof. (Ms.) Avani – Magistrate
14
MIS
absorb more statistical information visually than verbally and this format for presentation can be very persuasive and powerful if used in the right context. Data mining has many purposes and can be used for both positive and malicious gain. More organizations are coming to discover the benefits of merging data mining techniques to form hybrid models. These powerful combinations often result in applications with superior performance. By integrating the key features of different methods into single hybrid solutions, organizations usually can overcome the limitations of individual strategy systems.
SAN: Storage Area Networks
Definition: A SAN is a dedicated network that is separate from LANs and WANs. It is generally used to connect all the storage resources connected to various servers. It consists of a collection of SAN Hardware and SAN Software; the hardware typically has high inter-connection rates between the various storage devices and the software manages, monitors and configures the SAN. Introduction The main objective of a SAN is to facilitate the exchange of data between operating systems and storage elements. Components of a SAN infrastructure include communication infrastructure, storage elements, computer systems, and a management layer. The connecting elements of a SAN network include routers, gateways, hubs, switches and directors. A SAN removes restrictions on the number of servers that a storage utility can be attached to. The flexible networking of a SAN eliminates the need for physical proximity between the server and the storage devices. Benefits of a SAN include faster transfer of data to the intended destination with minimum utilization of server capacities, access for multiple hosts to several storage devices, independent storage speeds up applications and offers better availability, the management of stored data is easier and centralized and devices are more amenable to scalability. SANs have led to the development of several new methods for attaching servers to storage devices such as optical jukeboxes, tape libraries, and disk arrays. The high-speed transfer of data via a SAN can occur in the following ways – 1. Connecting server/s to storage devices – This is the most commonly used method and allows for the access of a storage device by servers either serially or simultaneously. 2. Connection between servers – SAN enables high-volume transfer of data between servers. 3. Connection between storage devices – Useful for moving data between storage devices without eating into server capacities, which can then be utilized for other activities. The Need for SAN The I/O bandwidth of the networks that were earlier used to connect the data storage devices and the processors was not commensurate with the capacities of the disk arrays and computers that utilized the data stored in them. The access to data is further complicated by the different database software run on different platforms. Managing different file systems and data formats requires trained manpower. The traditionally distributed storage has been a huge drain on management resources and inefficient as well in terms of capacity utilization of hardware resources. Scalability is also an issue when disk capacity is tied down to a single
Prof. (Ms.) Avani – Magistrate 15 MIS
server or client. Sharing of data often requires creating duplicate copies, moving these copies slows down the LAN/WAN and often co-ordination between applications such as BI, CRM, and ERP that are spread over the entire organization becomes very difficult. SAN Infrastructure SAN topologies are predominantly developed using fiber channels. Fiber channel is an open technical standard developed for networking and is especially useful for handling storage communications as it offers flexible connectivity and fast access to data. Optical fibers are used for long-distance networking and copper cable links are preferred for shorter distances due to their lower cost. Fiber channels can support different protocols and a large number of devices, a quality very desirable in any networking solution. Storage Storage devices commonly connected through SAN include disk systems and tape systems. Disk systems offer simple integration as the I/O control is centralized. Disk systems are classified as Just A Bunch Of Disks (JBOD) and Redundant Array of Independent Disks (RAID). Disks in a JBOD are treated as individual storage devices by the applications they are connected to. RAIDs are treated as a single device that has a higher fault tolerance. An array of disks can be made to behave as a JBOD or RAID depending upon the performance requirements of a SAN. Disk systems are preferred for online data storage because of their high performance. Tape systems make use of tapes arranged serially; parallel arrangements are not possible. Tape systems consist of drives, autoloaders, and libraries. Tape drives connect the tapes to the devices and enable the reading/writing from and to the tapes. Tape autoloaders are tape drives that perform the function of auto backup; they are used for devices that generate a lot of data constantly. Tape libraries are autonomous sets of tape drives and autoloaders. They are used in situations where the storage capacity required is very high. Tape systems are used for offline storage because of their cost efficiency. Benefits of SAN One of the chief benefits of SAN is that it simplifies the network infrastructure and makes it easier to manage. This it does by means of consolidation, virtualization, automation, and integration. Consolidation aims at centralizing the storage to improve scalability, reducing infrastructure complexity, and increasing efficiency. Virtualization helps improve availability and reduces costs as it offers a holistic view of storage components. Automation of routine tasks allows the administrators to focus on critical tasks. Automation also improves responsiveness. Integration helps organization furnish users with the desired information in a more systematic manner. A SAN makes information lifecycle management easier because of the integrated view of the data that it offers. Perhaps the biggest benefit of a SAN is that it complements expensive business applications that demand instant and real-time information. ERP and CRM systems can fulfil their business promise only if the right type of data is made available at the right time to the right person. To this end, a SAN is most useful and appropriate IBM IBM SAN products and solutions provide integrated SMB and enterprise SAN solutions with multi-protocol local, campus, metropolitan and global storage networking.
Prof. (Ms.) Avani – Magistrate
16
MIS
doc_490008990.docx