astrobase.fakelcs.recovery module

This is a companion module for fakelcs/generation.py. It runs LCs generated using functions in that module through variable star detection and classification to see how well they are recovered.

astrobase.fakelcs.recovery.read_fakelc(fakelcfile)[source]

This just reads a pickled fake LC.

Parameters:fakelcfile (str) – The fake LC file to read.
Returns:This returns an lcdict.
Return type:dict
astrobase.fakelcs.recovery.get_varfeatures(simbasedir, mindet=1000, nworkers=None)[source]

This runs lcproc.lcvfeatures.parallel_varfeatures on fake LCs in simbasedir.

Parameters:
  • simbasedir (str) – The directory containing the fake LCs to process.
  • mindet (int) – The minimum number of detections needed to accept an LC and process it.
  • nworkers (int or None) – The number of parallel workers to use when extracting variability features from the input light curves.
Returns:

The path to the varfeatures pickle created after running the lcproc.lcvfeatures.parallel_varfeatures function.

Return type:

str

astrobase.fakelcs.recovery.precision(ntp, nfp)[source]

This calculates precision.

https://en.wikipedia.org/wiki/Precision_and_recall

Parameters:
  • ntp (int) – The number of true positives.
  • nfp (int) – The number of false positives.
Returns:

The precision calculated using ntp/(ntp + nfp).

Return type:

float

astrobase.fakelcs.recovery.recall(ntp, nfn)[source]

This calculates recall.

https://en.wikipedia.org/wiki/Precision_and_recall

Parameters:
  • ntp (int) – The number of true positives.
  • nfn (int) – The number of false negatives.
Returns:

The precision calculated using ntp/(ntp + nfn).

Return type:

float

astrobase.fakelcs.recovery.matthews_correl_coeff(ntp, ntn, nfp, nfn)[source]

This calculates the Matthews correlation coefficent.

https://en.wikipedia.org/wiki/Matthews_correlation_coefficient

Parameters:
  • ntp (int) – The number of true positives.
  • ntn (int) – The number of true negatives
  • nfp (int) – The number of false positives.
  • nfn (int) – The number of false negatives.
Returns:

The Matthews correlation coefficient.

Return type:

float

astrobase.fakelcs.recovery.get_recovered_variables_for_magbin(simbasedir, magbinmedian, stetson_stdev_min=2.0, inveta_stdev_min=2.0, iqr_stdev_min=2.0, statsonly=True)[source]

This runs variability selection for the given magbinmedian.

To generate a full recovery matrix over all magnitude bins, run this function for each magbin over the specified stetson_stdev_min and inveta_stdev_min grid.

Parameters:
  • simbasedir (str) – The input directory of fake LCs.
  • magbinmedian (float) – The magbin to run the variable recovery for. This is an item from the dict from simbasedir/fakelcs-info.pkl: `fakelcinfo[‘magrms’][magcol] list for each magcol and designates which magbin to get the recovery stats for.
  • stetson_stdev_min (float) – The minimum sigma above the trend in the Stetson J variability index distribution for this magbin to use to consider objects as variable.
  • inveta_stdev_min (float) – The minimum sigma above the trend in the 1/eta variability index distribution for this magbin to use to consider objects as variable.
  • iqr_stdev_min (float) – The minimum sigma above the trend in the IQR variability index distribution for this magbin to use to consider objects as variable.
  • statsonly (bool) – If this is True, only the final stats will be returned. If False, the full arrays used to generate the stats will also be returned.
Returns:

The returned dict contains statistics for this magbin and if requested, the full arrays used to calculate the statistics.

Return type:

dict

astrobase.fakelcs.recovery.magbin_varind_gridsearch_worker(task)[source]

This is a parallel grid search worker for the function below.

astrobase.fakelcs.recovery.variable_index_gridsearch_magbin(simbasedir, stetson_stdev_range=(1.0, 20.0), inveta_stdev_range=(1.0, 20.0), iqr_stdev_range=(1.0, 20.0), ngridpoints=32, ngridworkers=None)[source]

This runs a variable index grid search per magbin.

For each magbin, this does a grid search using the stetson and inveta ranges provided and tries to optimize the Matthews Correlation Coefficient (best value is +1.0), indicating the best possible separation of variables vs. nonvariables. The thresholds on these two variable indexes that produce the largest coeff for the collection of fake LCs will probably be the ones that work best for actual variable classification on the real LCs.

https://en.wikipedia.org/wiki/Matthews_correlation_coefficient

For each grid-point, calculates the true positives, false positives, true negatives, false negatives. Then gets the precision and recall, confusion matrix, and the ROC curve for variable vs. nonvariable.

Once we’ve identified the best thresholds to use, we can then calculate variable object numbers:

  • as a function of magnitude
  • as a function of period
  • as a function of number of detections
  • as a function of amplitude of variability

Writes everything back to simbasedir/fakevar-recovery.pkl. Use the plotting function below to make plots for the results.

Parameters:
  • simbasedir (str) – The directory where the fake LCs are located.
  • stetson_stdev_range (sequence of 2 floats) – The min and max values of the Stetson J variability index to generate a grid over these to test for the values of this index that produce the ‘best’ recovery rate for the injected variable stars.
  • inveta_stdev_range (sequence of 2 floats) – The min and max values of the 1/eta variability index to generate a grid over these to test for the values of this index that produce the ‘best’ recovery rate for the injected variable stars.
  • iqr_stdev_range (sequence of 2 floats) – The min and max values of the IQR variability index to generate a grid over these to test for the values of this index that produce the ‘best’ recovery rate for the injected variable stars.
  • ngridpoints (int) –

    The number of grid points for each variability index grid. Remember that this function will be searching in 3D and will require lots of time to run if ngridpoints is too large.

    For the default number of grid points and 25000 simulated light curves, this takes about 3 days to run on a 40 (effective) core machine with 2 x Xeon E5-2650v3 CPUs.

  • ngridworkers (int or None) – The number of parallel grid search workers that will be launched.
Returns:

The returned dict contains a list of recovery stats for each magbin and each grid point in the variability index grids that were used. This dict can be passed to the plotting function below to plot the results.

Return type:

dict

astrobase.fakelcs.recovery.plot_varind_gridsearch_magbin_results(gridsearch_results)[source]

This plots the gridsearch results from variable_index_gridsearch_magbin.

Parameters:gridsearch_results (dict) – This is the dict produced by variable_index_gridsearch_magbin above.
Returns:The returned dict contains filenames of the recovery rate plots made for each variability index. These include plots of the precision, recall, and Matthews Correlation Coefficient over each magbin and a heatmap of these values over the grid points of the variability index stdev values arrays used.
Return type:dict
astrobase.fakelcs.recovery.run_periodfinding(simbasedir, pfmethods=('gls', 'pdm', 'bls'), pfkwargs=({}, {}, {'startp': 1.0, 'maxtransitduration': 0.3}), getblssnr=False, sigclip=5.0, nperiodworkers=10, ncontrolworkers=4, liststartindex=None, listmaxobjects=None)[source]

This runs periodfinding using several period-finders on a collection of fake LCs.

As a rough benchmark, 25000 fake LCs with 10000–50000 points per LC take about 26 days in total to run on an invocation of this function using GLS+PDM+BLS and 10 periodworkers and 4 controlworkers (so all 40 ‘cores’) on a 2 x Xeon E5-2660v3 machine.

Parameters:
  • pfmethods (sequence of str) – This is used to specify which periodfinders to run. These must be in the lcproc.periodsearch.PFMETHODS dict.
  • pfkwargs (sequence of dict) – This is used to provide optional kwargs to the period-finders.
  • getblssnr (bool) – If this is True, will run BLS SNR calculations for each object and magcol. This takes a while to run, so it’s disabled (False) by default.
  • sigclip (float or int or sequence of two floats/ints or None) –

    If a single float or int, a symmetric sigma-clip will be performed using the number provided as the sigma-multiplier to cut out from the input time-series.

    If a list of two ints/floats is provided, the function will perform an ‘asymmetric’ sigma-clip. The first element in this list is the sigma value to use for fainter flux/mag values; the second element in this list is the sigma value to use for brighter flux/mag values. For example, sigclip=[10., 3.], will sigclip out greater than 10-sigma dimmings and greater than 3-sigma brightenings. Here the meaning of “dimming” and “brightening” is set by physics (not the magnitude system), which is why the magsarefluxes kwarg must be correctly set.

    If sigclip is None, no sigma-clipping will be performed, and the time-series (with non-finite elems removed) will be passed through to the output.

  • nperiodworkers (int) – This is the number of parallel period-finding worker processes to use.
  • ncontrolworkers (int) – This is the number of parallel period-finding control workers to use. Each control worker will launch nperiodworkers worker processes.
  • liststartindex (int) – The starting index of processing. This refers to the filename list generated by running glob.glob on the fake LCs in simbasedir.
  • maxobjects (int) – The maximum number of objects to process in this run. Use this with liststartindex to effectively distribute working on a large list of input light curves over several sessions or machines.
Returns:

The path to the output summary pickle produced by lcproc.periodsearch.parallel_pf

Return type:

str

astrobase.fakelcs.recovery.check_periodrec_alias(actualperiod, recoveredperiod, tolerance=0.001)[source]

This determines what kind of aliasing (if any) exists between recoveredperiod and actualperiod.

Parameters:
  • actualperiod (float) – The actual period of the object.
  • recoveredperiod (float) – The recovered period of the object.
  • tolerance (float) – The absolute difference required between the input periods to mark the recovered period as close to the actual period.
Returns:

The type of alias determined for the input combination of periods. This will be CSV string with values taken from the following list, based on the types of alias found:

['actual',
 'twice',
 'half',
 'ratio_over_1plus',
 'ratio_over_1minus',
 'ratio_over_1plus_twice',
 'ratio_over_1minus_twice',
 'ratio_over_1plus_thrice',
 'ratio_over_1minus_thrice',
 'ratio_over_minus1',
 'ratio_over_twice_minus1']

Return type:

str

astrobase.fakelcs.recovery.periodicvar_recovery(fakepfpkl, simbasedir, period_tolerance=0.001)[source]

Recovers the periodic variable status/info for the simulated PF result.

  • Uses simbasedir and the lcfbasename stored in fakepfpkl to figure out where the LC for this object is.
  • Gets the actual_varparams, actual_varperiod, actual_vartype, actual_varamplitude elements from the LC.
  • Figures out if the current objectid is a periodic variable (using actual_vartype).
  • If it is a periodic variable, gets the canonical period assigned to it.
  • Checks if the period was recovered in any of the five best periods reported by any of the period-finders, checks if the period recovered was a harmonic of the period.
  • Returns the objectid, actual period and vartype, recovered period, and recovery status.
Parameters:
  • fakepfpkl (str) – This is a periodfinding-<objectid>.pkl[.gz] file produced in the simbasedir/periodfinding subdirectory after run_periodfinding above is done.
  • simbasedir (str) – The base directory where all of the fake LCs and period-finding results are.
  • period_tolerance (float) – The maximum difference that this function will consider between an actual period (or its aliases) and a recovered period to consider it as as a ‘recovered’ period.
Returns:

Returns a dict of period-recovery results.

Return type:

dict

astrobase.fakelcs.recovery.periodrec_worker(task)[source]

This is a parallel worker for running period-recovery.

Parameters:task (tuple) –

This is used to pass args to the periodicvar_recovery function:

task[0] = period-finding result pickle to work on
task[1] = simbasedir
task[2] = period_tolerance
Returns:This is the dict produced by the periodicvar_recovery function for the input period-finding result pickle.
Return type:dict
astrobase.fakelcs.recovery.parallel_periodicvar_recovery(simbasedir, period_tolerance=0.001, liststartind=None, listmaxobjects=None, nworkers=None)[source]

This is a parallel driver for periodicvar_recovery.

Parameters:
  • simbasedir (str) – The base directory where all of the fake LCs and period-finding results are.
  • period_tolerance (float) – The maximum difference that this function will consider between an actual period (or its aliases) and a recovered period to consider it as as a ‘recovered’ period.
  • liststartindex (int) – The starting index of processing. This refers to the filename list generated by running glob.glob on the period-finding result pickles in simbasedir/periodfinding.
  • listmaxobjects (int) – The maximum number of objects to process in this run. Use this with liststartindex to effectively distribute working on a large list of input period-finding result pickles over several sessions or machines.
  • nperiodworkers (int) – This is the number of parallel period-finding worker processes to use.
Returns:

Returns the filename of the pickle produced containing all of the period recovery results.

Return type:

str

astrobase.fakelcs.recovery.plot_periodicvar_recovery_results(precvar_results, aliases_count_as_recovered=None, magbins=None, periodbins=None, amplitudebins=None, ndetbins=None, minbinsize=1, plotfile_ext='png')[source]

This plots the results of periodic var recovery.

This function makes plots for periodicvar recovered fraction as a function of:

  • magbin
  • periodbin
  • amplitude of variability
  • ndet

with plot lines broken down by:

  • magcol
  • periodfinder
  • vartype
  • recovery status

The kwargs magbins, periodbins, amplitudebins, and ndetbins can be used to set the bin lists as needed. The kwarg minbinsize controls how many elements per bin are required to accept a bin in processing its recovery characteristics for mags, periods, amplitudes, and ndets.

Parameters:
  • precvar_results (dict or str) – This is either a dict returned by parallel_periodicvar_recovery or the pickle created by that function.
  • aliases_count_as_recovered (list of str or 'all') –

    This is used to set which kinds of aliases this function considers as ‘recovered’ objects. Normally, we require that recovered objects have a recovery status of ‘actual’ to indicate the actual period was recovered. To change this default behavior, aliases_count_as_recovered can be set to a list of alias status strings that should be considered as ‘recovered’ objects as well. Choose from the following alias types:

    'twice'                    recovered_p = 2.0*actual_p
    'half'                     recovered_p = 0.5*actual_p
    'ratio_over_1plus'         recovered_p = actual_p/(1.0+actual_p)
    'ratio_over_1minus'        recovered_p = actual_p/(1.0-actual_p)
    'ratio_over_1plus_twice'   recovered_p = actual_p/(1.0+2.0*actual_p)
    'ratio_over_1minus_twice'  recovered_p = actual_p/(1.0-2.0*actual_p)
    'ratio_over_1plus_thrice'  recovered_p = actual_p/(1.0+3.0*actual_p)
    'ratio_over_1minus_thrice' recovered_p = actual_p/(1.0-3.0*actual_p)
    'ratio_over_minus1'        recovered_p = actual_p/(actual_p - 1.0)
    'ratio_over_twice_minus1'  recovered_p = actual_p/(2.0*actual_p - 1.0)
    

    or set aliases_count_as_recovered=’all’ to include all of the above in the ‘recovered’ periodic var list.

  • magbins (np.array) – The magnitude bins to plot the recovery rate results over. If None, the default mag bins will be used: np.arange(8.0,16.25,0.25).
  • periodbins (np.array) – The period bins to plot the recovery rate results over. If None, the default period bins will be used: np.arange(0.0,500.0,0.5).
  • amplitudebins (np.array) – The variability amplitude bins to plot the recovery rate results over. If None, the default amplitude bins will be used: np.arange(0.0,2.0,0.05).
  • ndetbins (np.array) – The ndet bins to plot the recovery rate results over. If None, the default ndet bins will be used: np.arange(0.0,60000.0,1000.0).
  • minbinsize (int) – The minimum number of objects per bin required to plot a bin and its recovery fraction on the plot.
  • plotfile_ext ({'png','pdf'}) – Sets the plot output files’ extension.
Returns:

A dict containing recovery fraction statistics and the paths to each of the plots made.

Return type:

dict