TY - JOUR
T1 - Box queries over multi-dimensional streams
AU - Friedman, Roy
AU - Shahout, Rana
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/11
Y1 - 2022/11
N2 - Answering statistical queries about streams of online arriving data is becoming increasingly important. Often, such data includes multiple-attributes, so data elements can be viewed as points in a multi-dimensional universe. This paper extends existing works on streaming algorithms by studying the ability to perform box queries on online multi-dimensional data streams. We develop three algorithms C-DARQ, DARQ and MARQ that support such capabilities for a large number of statistical functions including (but not limited to) counting, frequency estimation, heavy-hitters etc. We also apply our algorithms in distributed settings, in which measurements are recorded independently by multiple sites (e.g., multiple routers), and the goal is to obtain a global network analysis. The protocols are analyzed and evaluated over synthetic dataset, Chicago dataset, and a Facebook dataset from Kaggle in multiple dimensions (up to 10). Our algorithms asymptotically improve the space bounds as well as update and query performance of existing works. Unlike known approaches, our algorithms can also be used to solve a larger class of problems beyond counting. We further discuss extending our work to the sliding window model and when the dimensions’ bounds are a-priori unknown.
AB - Answering statistical queries about streams of online arriving data is becoming increasingly important. Often, such data includes multiple-attributes, so data elements can be viewed as points in a multi-dimensional universe. This paper extends existing works on streaming algorithms by studying the ability to perform box queries on online multi-dimensional data streams. We develop three algorithms C-DARQ, DARQ and MARQ that support such capabilities for a large number of statistical functions including (but not limited to) counting, frequency estimation, heavy-hitters etc. We also apply our algorithms in distributed settings, in which measurements are recorded independently by multiple sites (e.g., multiple routers), and the goal is to obtain a global network analysis. The protocols are analyzed and evaluated over synthetic dataset, Chicago dataset, and a Facebook dataset from Kaggle in multiple dimensions (up to 10). Our algorithms asymptotically improve the space bounds as well as update and query performance of existing works. Unlike known approaches, our algorithms can also be used to solve a larger class of problems beyond counting. We further discuss extending our work to the sliding window model and when the dimensions’ bounds are a-priori unknown.
UR - http://www.scopus.com/inward/record.url?scp=85132831546&partnerID=8YFLogxK
U2 - 10.1016/j.is.2022.102086
DO - 10.1016/j.is.2022.102086
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85132831546
SN - 0306-4379
VL - 109
JO - Information Systems
JF - Information Systems
M1 - 102086
ER -