CORC  > 云南天文台  > 中国科学院云南天文台  > 其他
题名太阳望远镜海量数据存储关键技术研究
作者刘应波
学位类别博士
答辩日期2015-07-01
授予单位中国科学院研究生院
授予地点北京
导师王锋
关键词海量太阳观测数据 高速分布式存储 数据一致性 海量数据检索
其他题名Research on the Key Technologies of Massive Data Storage for Solar Telescope
学位专业天文技术与方法
中文摘要当前天文数据处理技术已经进入了数据密集型的天文信息学时代,大数据是比较典型的特征。在太阳观测中,具体表现为数据量庞大、数据采集速率高和数据连续性增长。传统的本地存储技术,例如DAS,以及网络存储技术,例如NAS以及SAN等,在天文大数据存储、处理和管理的需求背景下,表现出诸多的局限性,这些局限性为很多科研活动的开展带来不便。以海量数据为基础的现代天文观测迫切需要先进的大数据处理技术来加快数据的处理,例如MapReduce,为了支持这些处理技术的应用,存储系统需要能够提供高性能、可扩展的并发读写能力和具备海量天文数据的管理能力。 一米新真空红外太阳望远镜(The 1m New Vacuum Solar Telescope-NVST)已经投入运行,采用高速度、多通道、多终端的数据采集模式,目前已经产生了超过200TB的太阳观测数据。在观测条件理想时,光球和色球两个通道同时观测,当前色球通道和光球通道能够分别达到每小时60GB和190GB的高速采集速率,按照8小时观测时长计算,一天能够产生2TB左右的观测数据。随着NVST高分辨率成像系统对数据的时间和空间分辨率要求的提高,未来更多通道并发工作时,单向写入速度能够达到每秒TB量级。如果考虑到实时的数据处理,这个速度还要翻倍。在这样的速度下,单机硬盘存储已难以满足NVST持续、高速的数据写入。当前一些主流存储技术,例如固态硬盘,因为成本,读写次数有限等因素限制它们在太阳观测中的应用,这极大地限制了NVST的科研产出。 另外,传统的数据存储关键技术,例如本地文件系统Ext3、Ext4以及新兴的文件系统ZFS等已难以满足太阳观测中高速的并发数据读写需求;基于关系型数据库的数据管理技术也不能很好的应对NVST海量数据管理的需要。面对这些问题,就迫切需要寻求能够管理海量数据,具有高性能、高扩展性以及能适应NVST存储需求动态变化和支持高速数据处理的存储技术。虽然一些前沿技术,例如基于DAS和SAN的存储整合技术、虚拟化存储技术能够满足这些需要,但是他们的技术复杂、实际部署、配置和管理维护成本较高,也不适合在太阳观测中应用。分布式并行存储技术能够很好地满足这些需求,因为基于分布式的存储能够提供高性能的并发存储并具有良好的横向扩展特性,可以部署在普通的廉价主机上,综合成本、性能和可扩展管理等方面的考虑,分布式存储比较适合NVST多通道多波段观测模式的海量数据存储技术。另外,如何高效的检索和查询海量数据也是存储管理中比较关键的问题,基于分布式的非关系型数据库(NoSQL)数据存储管理技术能够有效应对这些问题。因此,本论文以分布式存储技术为核心,研究分布式文件系统和基于NoSQL海量数据检索查询技术在太阳观测中的应用,论文主要研究工作包括: 1)分布式文件系统在太阳观测中的应用。通过模拟实验从横向和纵向两方面深入研究了分布式文件系统的存储性能、可扩展性,以及分布式文件系统在太阳观测中应用的可行性;研究了基于FITS文件的存储性能优化,通过Bonding 技术在千兆网络环境下单进程能够达到3.4Gb/s的存储速度,满足了NVST当前高速的存储需要;重点研究了分布式文件系统在太阳观测中的应用模式和如何满足异构平台的数据存储需要。 2)研究了太阳FITS元数据和数据在分布式存储中的不一致性问题。在分布式存储环境下,因为高效的数据查询和管理需要,观测的FITS元数据与数据被分离存储。这可能因为短暂的网络、硬盘等故障导致大量的元数据和数据之间的不一致。如何采取有效的保障机制约束元数据和数据之间的一致性是在数据存储过程中容易被忽略的问题。本文在这方面进行了研究,分析了不一致性产生的原因,不一致性模型以及应对措施,并提出使用两段提交协议来尽可能保证二者之间的一致性。 3)设计了面向太阳观测的分布式存储系统AstroFS,阐述了它的核心组件设计。其中包括了高性能特性设计,例如,根据太阳观测的要求,放弃多层次树状文件目录,使用两级扁平化的目录存储观测文件;研究设计基于网络的RAID0数据分片技术。对系统中的其它关键技术也进行了详细的分析和设计,例如数据的聚合拆分,数据均衡分布存储等。 4)通过形式化方法描述了NoSQL存储FITS文件的存储和查询模式,使用基于压缩的字对齐位图索引算法来对海量天文数据进行索引。设计和实现了一个基于Fastbit的天文观测数据归档系统。该系统具有高效的索引性能和检索效率等优点。 论文研究的面向太阳观测数据的分布式存储技术和海量数据检索技术解决了NVST对数据快速存储和高效访问的需求,实际应用性较强。研究方法也为未来国内外类似太阳望远镜的存储设计和海量数据的检索提供了参考,具有一定的应用和推广价值。
英文摘要Currently, astronomical data processing technology has entered the era of data-intensive astronomy informatics. Big data is a typical characteristic with large amount of data, fast data capturing rate and continuous data growing in solar observation. Traditional local host data storage technology, such as DAS and other network storage technologies, such as NAS and SAN, perform many limitations under the background of astronomical big data storing, processing and data management. It slows down the procedure of scientific research. Modern astronomical observation needs advanced big data technologies to accelerate data processing. The storage system for these data processing technologies has to provide high performance and extendable parallel reading and writing ability, efficient data indexing and querying and also should adopt to manage the fast growing of observation data. The New Vacuum Solar Telescope (NVST) has begun routine observation and produced over 200TB solar observation data by using the mode of high speed, multi-channel and multi-wavelength. When two channels of photosphere and chromosphere are observed at the same time under proper observing conditions, the chromospheric channel can reach at the rate of 60GB per hour, photospheric channel can reach the rate of 190GB per hour. About 2TB data can be produced in 8 hours continues observation. With high time and space resolution of data requirements of NVST and multi-channel parallel working together in the future, single-direction writing speed can reach at the level of TB per second. If the real-time data processing has taken into account, the rate will be doubled. Through there are some storage technologies can achieve at good performance and can be extensible, but data characteristics of continuous storing ultimately limit the use of these main stream technologies. Traditional local file systems such as Ext3, Ext4 and ZFS are hard to satisfy the requirements of NVST, so we need to find a storage technology which can manage massive data, has high performance, be highly extendibility, can adopt to future data storage of NVST and can support massive high speed data processing. With devices like larger telescope in use, the storage system needs suitable technologies to support massive high speed data storing, reading and processing. Distributed parallel storage is the technology which can well satisfy these needs, because distributed architecture can supply high performance, parallel storing and has the ability of scale-out, which is more suitable for multi-channel, multi-waveband, high speed and massive data continuously growing like NVST. In this dissertation, key techniques of distributed storage are mainly researched. NoSQL based bitmap index is also studied to satisfy the needs of massive data indexing and data retrieving. This dissertation research mainly covers the following aspects, 1) Applying distributed storage to solar observation. We use experiments to verify the feasibility of high performance and extensibility of distributed storage. We achieve at the data acquisition rate of 3.4Gb/s by using bonding technology in the 1Gb network environment. 2) High speed data storing may lead to inconsistency problem between metadata and data stored separately. How to take effective mechanism to keep the consistency of metadata and data is an ignored issue in data storage. This dissertation analyzed the reasons, the states and the models of the inconsistency. 2PC algorithm is adopted to ensure the consistency. 3) We design a distributed storage system called AstroFS based on the mechanism of RAID0 under the network environment in order to perform high performance. Key technologies have carried out. Such as data aggregation, splitting algorithms, data balance strategies and so on. 4) This paper uses compressed word-aligned bitmap index to build index for massive solar data. We also design and realize an astronomical data archiving system(DAS) based on Fastbit. Compared to technique based on relational databases, DAS has many advantages, such as more efficiently retrieval, faster index building and so on. The distributed storage and massive data retrieval technologies researched in this dissertation satisfies the requirements of NVST data storing and management. The research methods also make a reference for the design of the massive data storage and data retrieval applications of the foreign and domestic large solar telescopes.
语种中文
学科主题天文学
页码145
内容类型学位论文
源URL[http://ir.ynao.ac.cn/handle/114a53/4382]  
专题云南天文台_其他
作者单位中国科学院云南天文台
推荐引用方式
GB/T 7714
刘应波. 太阳望远镜海量数据存储关键技术研究[D]. 北京. 中国科学院研究生院. 2015.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace