您所在位置：网站首页 > 学术论文 > 其它学术论文 > 数据仓库与数据挖掘考试翻译

数据仓库与数据挖掘考试翻译.docx

4页

卖家[上传人]：新**

文档编号：445288483

上传时间：2023-05-06

文档格式：DOCX

文档大小：20.12KB

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10金贝

下载

/ 4 举报版权申诉马上下载

文本预览

下载提示

常见问题

Data Mining Exam ManualBy Ronel Li, Zhangwei Tuo, Jia Li, Info deptI. fill in the blanks (20%)1. From a data analysis point of view, data mining can be classified into two categories: ( predictive ) and ( descriptive ) data mining.从数据分析的角度来看，数据挖掘可以分为两大类：(预测)和(描述)数据挖掘2. Data model in data warehouse contains conceptual model, ( logical model ),( physical model ) and metadata model.数据模型在数据仓库包含概念模型、(逻辑模型)，(物理模型)和元数据模型3. From user view, metadata can be divided into ( business ) and ( technical ) metadata.从用户的观点，元数据可以分为(商业)和(技术)元数据。

4. OLAP operations in the multidimensional data model are ( slice ), dice (drill-down ), ( roll-up ) and rotate.OLAP多维数据模型中的操作(分层)、切片(下钻)，(累积)和旋转5. Methods of data preprocessing include ( data cleaning, data integration & transformation, data reduction )数据预处理的方法包括(数据清理、数据集成和转换、数据缩减6. In data warehouse, Data can be organized ( current detailed ), older detailed, ( lightly summarized ) and highly summarized.在数据仓库中，数据可以组织(当前详细,更早而详细、简单概括)和高度概括7. A three-tier data warehousing architecture include a warehouse database server (OLAP server ) and client.一个三层的数据仓库架构包括一个仓库数据库服务器(OLAP服务器)和客户端。

8. ( Data discretization ) techniques can be used to reduce the number of values for a given continuous attributes, by dividing the range of the attribute into intervals.(数据离散化)技术可以用于减少为给定的值的数量连续属性，分裂的范围属性分为间隔9. ( A concept hierarchy ) defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts.(概念层次)定义一组低层次的概念向更高层次、更一般层次序列映射的的概念10. ( A data cube ) allows data to be modeled and viewed in multiple dimensions, it is defined by dimensions and facts.(数据立方体)允许数据建模和浏览多个维度，它定义了维度和事实。

11. In general, major clustering methods can be classified into the following categories: partitioning methods, hierarchical methods, ( density-based methods ) ( grid-based methods ) and model-based methods.一般来说，主要的聚类方法可分为以下几类:分区方法,分级法，(基于密度的方法)(基于网格的方法)和基于模型的方法II. True-false question (30%)Please try your own luck …III. Short Answer Questions (20%)1. How the noisy data is smoothed with binning during preprocessing?Answer:Binning methods smooth a sorted data value by consulting the values around it.Binning Techniques includes:1) Smoothing by bin means,2) Smoothing by bin medians,3) Smoothing by bin boundries.2. The theory or basic steps of k-means algorithm.Answer:Following is the theory structure of K-means algorithm:arbitrarily choose k objects from D as the initial cluster centers; 从。

任意选择她对象作为初始集群中心；repeat(re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster;(重新)分配到集群的每个对象是最相似，基于平均价值物体的集群update the cluster means, i.e., calculate the mean value of the objects for each cluster;until no change.3. How to fill the empty property values?Answer:The methods of filling the empty property values are below:填补空属性值的方法如下1) Ignore the tuple 忽略元组2) Fill in the missing value manually 手工填补缺省值3) Use a global constant to fill in the missing value 使用一个全局常数填补缺失值4) Use the attribute mean to fill in the missing value 使用该属性平均值填补缺失值5) Use the attributemean for all samples belonging to the same class as the given tuple使用attributemean所有样品属于同一类作为给定的元组6) Use the most probable value to he niisstng value 使用最可能的值来填充缺失的值IV. Calculation (20%) (Simple Instructions Here)1. Given minimum support factor, How to find the frequent item sets of association rules.Stan forhcmsctSup. udljtiLcount of each6cjmdichue7113}fi㈣22Coinptire c^ndidute support count with minimum suppoii ctmnl| iLcmg || Sup, cuunt |z7m}521 Z 1Generate candidates from L；[remsetScan D for coudl of each candidate ►hemsetSup. countComparo candidate support ctmni with minimum support counthemsetSup. coimi{IKK}{I】、⑷ {[1，w [I2J3| {214} {I2J51 {13, r4) (13,15( {HJ5|HI-⑵ (II-13) |H 14J(12,13) |I2, 14； I12J51 (H, 14)13⑸ (14,15}〔IL⑵ 〔II 13} {HJS} (11 13) {12,14] |T2,15}4Conipdrt candidulc support count whh rniniTnuTTi support count捻GenciTite Cj candidates from i-2ItemseiScan D for cciuut of eachliemsetSup. countItemsetSup. cuunt『1,12,13}Hl, 12,15}(IL 12,13}(HJ2J5}22{II, 12.13}(IIJ2,15}22 AEnglish textbook page: 233Chinese textbook page: 1482. Calculate the value of a property's information gain using the decision tree algorithm (A specific dataset would be given).Gain(A)=Info(D)-InfoA(D)Info(D)=栏 D log (D )i=1寸| D | InfoA(D)= Y -D- x Info(D )j=1English textbook page: 192Chinese textbook page: 2873. How to measure the dissimilarity of two objects in the cluster? How to calculate the distance of binary attributessymmetric binary dissimilarity.r + sd(i, j)= q + r + s +1d(i, j)=asymmetric binary dissimilarity.M说ry) =力+°+] =0.33dijack^ Jim) = = 0.67d (Mary, J 油)—匕於七=0.75I A rela。

点击阅读更多内容