Statistical Analysis of String-Counts in Genetics

This talk provides some introductory discussion of the probabilistic and statistical analysis of “string-counts” arising in fields such as genetic analysis.  Statistical analysis of string-counts occurs when we have a random “text” composed of “symbols” (e.g., genetic nucleotides) and we examine the number of occurrences of a particular string within this text.  If we have an underlying string of symbols that is random then the string-count for a particular string is also random, and one can examine its probabilistic behaviour.  In a genetic context, this type of probabilistic and statistical analysis can be used to confirm that a particular string (e.g., a gene) occurs more often than would be expected “at random”.

The talk will be pitched primarily for an audience without specialised knowledge in probability and statistics.  No assumed knowledge of probability of textual analysis will be assumed.  We will explain a few of the mathematical concepts and techniques for this problem in a gentle and simplified manner, with a few technical titbits thrown in to hold the attention of any statisticians in the audience.  Though the talk will be pitched at an introductory level, it should provide a general understanding of how statisticians go about determining whether or not a string-count for a specified string is large or small relative to what would be expected “at random”.

About Ben O'Neill

Dr O’Neill is a statistician and data scientist specialising in experimental design and statistical modelling.  He has broad research interests in all aspects of statistical theory and modelling, causal analysis and experimental design.  He has expertise in statistical programming in R, and is also a regular contributor to the statistical Q&A website CrossValidated.  He previously worked as a Lecturer in Statistics at UNSW, and has done a number of projects as a consultant with various government and industry bodies.