Generalized Unique Reconstruction From Substrings

Yonatan Yehezkeally; Daniella Bar-Lev; Sagi Marcovich; Eitan Yaakobi

doi:10.1109/TIT.2023.3269124

Generalized Unique Reconstruction From Substrings

Yonatan Yehezkeally, Daniella Bar-Lev, Sagi Marcovich, Eitan Yaakobi

Computer Science

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this work studies two extensions of this paradigm. The first extension considers the setup in which consecutive substrings are read with some given minimum overlap. First, an upper bound is provided on the attainable rates of codes that guarantee unique reconstruction. Then, efficient constructions of codes that asymptotically meet that upper bound are presented. In the second extension, we study the setup where multiple strings are reconstructed together. Given the number of strings and their length, we first derive a lower bound on the read substrings' length ℓ that is necessary for the existence of multi-strand reconstruction codes with non-vanishing rates. We then present two constructions of such codes and show that their rates approach 1 for values of ℓ that asymptotically behave like the lower bound.

Original language	English
Pages (from-to)	5648-5659
Number of pages	12
Journal	IEEE Transactions on Information Theory
Volume	69
Issue number	9
DOIs	https://doi.org/10.1109/TIT.2023.3269124
State	Published - 1 Sep 2023

Keywords

DNA sequences
Sequence reconstruction
error correction codes
worst-case analysis

ASJC Scopus subject areas

Information Systems
Library and Information Sciences
Computer Science Applications

Access to Document

10.1109/TIT.2023.3269124

Cite this

@article{1db80e618bef43d3997f1dd6d43a9526,

title = "Generalized Unique Reconstruction From Substrings",

abstract = "This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this work studies two extensions of this paradigm. The first extension considers the setup in which consecutive substrings are read with some given minimum overlap. First, an upper bound is provided on the attainable rates of codes that guarantee unique reconstruction. Then, efficient constructions of codes that asymptotically meet that upper bound are presented. In the second extension, we study the setup where multiple strings are reconstructed together. Given the number of strings and their length, we first derive a lower bound on the read substrings' length ℓ that is necessary for the existence of multi-strand reconstruction codes with non-vanishing rates. We then present two constructions of such codes and show that their rates approach 1 for values of ℓ that asymptotically behave like the lower bound.",

keywords = "DNA sequences, Sequence reconstruction, error correction codes, worst-case analysis",

author = "Yonatan Yehezkeally and Daniella Bar-Lev and Sagi Marcovich and Eitan Yaakobi",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.",

year = "2023",

month = sep,

day = "1",

doi = "10.1109/TIT.2023.3269124",

language = "אנגלית",

volume = "69",

pages = "5648--5659",

number = "9",

}

TY - JOUR

T1 - Generalized Unique Reconstruction From Substrings

AU - Yehezkeally, Yonatan

AU - Bar-Lev, Daniella

AU - Marcovich, Sagi

AU - Yaakobi, Eitan

PY - 2023/9/1

Y1 - 2023/9/1

N2 - This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this work studies two extensions of this paradigm. The first extension considers the setup in which consecutive substrings are read with some given minimum overlap. First, an upper bound is provided on the attainable rates of codes that guarantee unique reconstruction. Then, efficient constructions of codes that asymptotically meet that upper bound are presented. In the second extension, we study the setup where multiple strings are reconstructed together. Given the number of strings and their length, we first derive a lower bound on the read substrings' length ℓ that is necessary for the existence of multi-strand reconstruction codes with non-vanishing rates. We then present two constructions of such codes and show that their rates approach 1 for values of ℓ that asymptotically behave like the lower bound.

AB - This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous works considered two extreme cases in which all substrings of pre-defined lengths are read or substrings are read with no overlap for the single string case, this work studies two extensions of this paradigm. The first extension considers the setup in which consecutive substrings are read with some given minimum overlap. First, an upper bound is provided on the attainable rates of codes that guarantee unique reconstruction. Then, efficient constructions of codes that asymptotically meet that upper bound are presented. In the second extension, we study the setup where multiple strings are reconstructed together. Given the number of strings and their length, we first derive a lower bound on the read substrings' length ℓ that is necessary for the existence of multi-strand reconstruction codes with non-vanishing rates. We then present two constructions of such codes and show that their rates approach 1 for values of ℓ that asymptotically behave like the lower bound.

KW - DNA sequences

KW - Sequence reconstruction

KW - error correction codes

KW - worst-case analysis

UR - http://www.scopus.com/inward/record.url?scp=85153797205&partnerID=8YFLogxK

U2 - 10.1109/TIT.2023.3269124

DO - 10.1109/TIT.2023.3269124

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

AN - SCOPUS:85153797205

SN - 0018-9448

VL - 69

SP - 5648

EP - 5659

JO - IEEE Transactions on Information Theory

JF - IEEE Transactions on Information Theory

IS - 9

ER -

Generalized Unique Reconstruction From Substrings

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this